提问人:Alok 提问时间:10/5/2023 最后编辑:Alok 更新时间:10/5/2023 访问量:53
解析文件并创建数据结构
Parse a file and create a data structure
问:
我们想解析一个文件并创建一个某种数据结构供以后使用(在 Python 中)。文件的内容如下所示:
plan HELLO
feature A
measure X :
src = "Type ,Name"
endmeasure //X
measure Y :
src = "Type ,Name"
endmeasure //Y
feature Aa
measure AaX :
src = "Type ,Name"
"Type ,Name2"
"Type ,Name3"
endmeasure //AaX
measure AaY :
src = "Type ,Name"
endmeasure //AaY
feature Aab
.....
endfeature // Aab
endfeature //Aa
endfeature // A
feature B
......
endfeature //B
endplan
plan HOLA
endplan //HOLA
因此,有一个包含一个或多个计划的文件,然后每个计划包含一个或多个功能,此外,每个功能都包含一个包含信息(src、类型、名称)的度量值,并且功能可以进一步包含更多功能。
我们需要解析文件并创建一个数据结构,该结构将具有
plan (HELLO)
------------------------------
↓ ↓
Feature A Feature B
---------------------------- ↓
↓ ↓ ↓ ........
Measure X Measure Y Feature Aa
------------------------------
↓ ↓ ↓
Measure AaX Measure AaY Feature Aab
↓
.......
我正在尝试逐行解析文件并创建一个列表列表,其中包含计划 -> 功能 ->度量、功能
def getplans(s):
stack = [{}]
stack_list = []
for line in s.splitlines():
if ": " in line: # leaf
temp_stack = {}
key, value = line.split(": ", 1)
key = key.replace("source","").replace("=","").replace("\"","").replace(";","")
value = value.replace("\"","").replace(",","").replace(";","")
temp_stack[key.strip()] = value.strip()
stack_list.append(temp_stack)
stack[-1]["MEASURED_VAL"] = stack_list
elif line.strip()[:3] == "end":
stack.pop()
stack_list = []
elif line.strip():
collection, name, *_ = line.split()
stack.append({})
stack[-2].setdefault(collection, {})[name] = stack[-1]
return stack[0]
答:
例如,查看文件,我会尝试将其转换为 // to tags,然后使用 HTML 解析器解析它(或者您可以尝试使用 YAML 进行相同的操作,然后使用 Yaml 解析器):plan
feature
measure
beautifulsoup
text = """\
plan HELLO
feature A
measure X :
src = "Type ,Name"
endmeasure //X
measure Y :
src = "Type ,Name"
endmeasure //Y
feature Aa
measure AaX :
src = "Type ,Name"
"Type ,Name2"
"Type ,Name3"
endmeasure //AaX
measure AaY :
src = "Type ,Name"
"Type ,Name2"
"Type ,Name3"
endmeasure //AaY
feature Aab
.....
endfeature // Aab
endfeature //Aa
endfeature // A
feature B
......
endfeature //B
endplan
plan HOLA
endplan //HOLA"""
import re
from bs4 import BeautifulSoup
data = re.sub(r"\b(plan|feature|measure)\s+([^:\s]+).*", r'<\g<1> name="\g<2>">', text)
data = re.sub(r"\b(?:end)(plan|feature|measure).*", r"</\g<1>>", data)
data = re.sub(r'src\s*=\s*((?:"[^"]+"\s*)+)', r"<src>\g<1></src>", data)
soup = BeautifulSoup(data, "html.parser")
for m in soup.select("measure"):
# find parent PLAN:
print("Plan:", m.find_parent("plan")["name"])
# find feature PLAN:
print("Parent Feature:", m.find_parent("feature")["name"])
print("Name:", m["name"])
for line in m.text.splitlines():
data = list(map(str.strip, line.strip(' "').split(",")))
if len(data) == 2:
print(data)
转换后的文本将是:
<plan name="HELLO">
<feature name="A">
<measure name="X">
<src>"Type ,Name"
</src></measure>
<measure name="Y">
<src>"Type ,Name"
</src></measure>
<feature name="Aa">
<measure name="AaX">
<src>"Type ,Name"
"Type ,Name2"
"Type ,Name3"
</src></measure>
<measure name="AaY">
<src>"Type ,Name"
"Type ,Name2"
"Type ,Name3"
</src></measure>
<feature name="Aab">
.....
</feature>
</feature>
</feature>
<feature name="B">
......
</feature>
</plan>
<plan name="HOLA">
</plan>
并输出:
Plan: HELLO
Parent Feature: A
Name: X
['Type', 'Name']
Plan: HELLO
Parent Feature: A
Name: Y
['Type', 'Name']
Plan: HELLO
Parent Feature: Aa
Name: AaX
['Type', 'Name']
['Type', 'Name2']
['Type', 'Name3']
Plan: HELLO
Parent Feature: Aa
Name: AaY
['Type', 'Name']
['Type', 'Name2']
['Type', 'Name3']
我不明白你为什么需要调用 ,或者 ,也不明白你为什么要尝试创建一个密钥,但看到你之前的问题,我会通过制作一个列表来扩展之前的答案,以便它可以收集多行数据:replace
source
;
MEASURED_VAL
src
def getplans(s):
stack = [{}]
stack_list = None
for line in s.splitlines():
if "=" in line: # leaf
key, value = line.split("=", 1)
stack_list = [value.strip(' "')] # create list for multiple entries
stack[-1][key.strip()] = stack_list
elif line.strip()[:3] == "end":
stack.pop()
stack_list = None
elif stack_list is not None: # continuation of leaf data
stack_list.append(line.strip(' "')) # extend the list for `src`
elif line.strip():
collection, name, *_ = line.split()
stack.append({})
stack[-2].setdefault(collection, {})[name] = stack[-1]
return stack[0]
您似乎正在尝试解析此文件的结构并创建一个方便的数据结构来表示计划、要素和度量的层次结构。您当前的方法使用堆栈来跟踪嵌套结构,这是非常合理的。
有几点需要注意:
您尝试从键和值中删除“source”、“=”、“”、“和 ”;“ 等字符看起来有些不必要。如果没有具体原因,最好将它们保留为原始形式以保持数据完整性。
确保正确处理块的末尾(例如,“endmeasure”和“endfeature”)非常重要。当遇到块的末尾时,将逻辑添加到堆栈中的弹出元素将有助于保持正确的嵌套。
下面是代码的更新版本,考虑了以下注意事项:
def parse_file(s):
stack = []
data = {}
for line in s.splitlines():
line = line.strip()
if line.startswith("plan"):
plan_name = line.split()[1]
data[plan_name] = {}
stack.append(data[plan_name])
elif line.startswith("feature"):
feature_name = line.split()[1]
data[plan_name][feature_name] = {}
stack.append(data[plan_name][feature_name])
elif line.startswith("measure"):
measure_name = line.split()[1]
data[plan_name][feature_name][measure_name] = {}
stack.append(data[plan_name][feature_name][measure_name])
elif line.startswith("endmeasure") or line.startswith("endfeature"):
stack.pop()
elif line.startswith("endplan"):
stack.pop()
plan_name = None
return data
此代码创建一个数据结构,用于反映输入文件中的计划、功能和度量值。您可以根据需要使用此数据结构进行进一步的数据操作。
评论