解析文件并创建数据结构

Parse a file and create a data structure

提问人:Alok 提问时间:10/5/2023 最后编辑:Alok 更新时间:10/5/2023 访问量:53

问:

我们想解析一个文件并创建一个某种数据结构供以后使用(在 Python 中)。文件的内容如下所示:

plan HELLO
   feature A 
       measure X :
          src = "Type ,Name"
       endmeasure //X

       measure Y :
        src = "Type ,Name"
       endmeasure //Y

       feature Aa
           measure AaX :
              src = "Type ,Name"
                    "Type ,Name2"
                    "Type ,Name3"
           endmeasure //AaX

           measure AaY :
              src = "Type ,Name"
           endmeasure //AaY
           
           feature Aab
              .....
           endfeature // Aab
         
       endfeature //Aa
 
   endfeature // A
   
   feature B
     ......
   endfeature //B
endplan

plan HOLA
endplan //HOLA

因此,有一个包含一个或多个计划的文件,然后每个计划包含一个或多个功能,此外,每个功能都包含一个包含信息(src、类型、名称)的度量值,并且功能可以进一步包含更多功能。

我们需要解析文件并创建一个数据结构,该结构将具有

                     plan (HELLO) 
            ------------------------------
             ↓                          ↓ 
          Feature A                  Feature B
  ----------------------------          ↓
   ↓           ↓             ↓           ........
Measure X    Measure Y    Feature Aa
                         ------------------------------
                            ↓           ↓             ↓ 
                       Measure AaX   Measure AaY   Feature Aab
                                                        ↓
                                                        .......

我正在尝试逐行解析文件并创建一个列表列表,其中包含计划 -> 功能 ->度量、功能

def getplans(s):
    stack = [{}]
    stack_list = []
    
    for line in s.splitlines():
        if ": " in line:  # leaf
            temp_stack = {}
            key, value = line.split(": ", 1)
            key = key.replace("source","").replace("=","").replace("\"","").replace(";","")
            value = value.replace("\"","").replace(",","").replace(";","")
            temp_stack[key.strip()] = value.strip()
            stack_list.append(temp_stack)
            stack[-1]["MEASURED_VAL"] = stack_list
        elif line.strip()[:3] == "end":
            stack.pop()
            stack_list = []
        elif line.strip():
            collection, name, *_ = line.split()
            stack.append({})
            stack[-2].setdefault(collection, {})[name] = stack[-1] 
    return stack[0]
python 列表 数据结构 readlines fileparse

评论

0赞 Random Davis 10/5/2023
好吧,看起来你忘了问一个问题,但是当你记得时,请编辑你的帖子以包含它。有关更多详细信息,请参阅如何询问

答:

0赞 Andrej Kesely 10/5/2023 #1

例如,查看文件,我会尝试将其转换为 // to tags,然后使用 HTML 解析器解析它(或者您可以尝试使用 YAML 进行相同的操作,然后使用 Yaml 解析器):planfeaturemeasurebeautifulsoup

text = """\
plan HELLO
   feature A
       measure X :
          src = "Type ,Name"
       endmeasure //X

       measure Y :
        src = "Type ,Name"
       endmeasure //Y

       feature Aa
           measure AaX :
              src = "Type ,Name"
                    "Type ,Name2"
                    "Type ,Name3"
           endmeasure //AaX

           measure AaY :
              src = "Type ,Name"
                    "Type ,Name2"
                    "Type ,Name3"
           endmeasure //AaY

           feature Aab
              .....
           endfeature // Aab

       endfeature //Aa

   endfeature // A

   feature B
     ......
   endfeature //B
endplan

plan HOLA
endplan //HOLA"""

import re

from bs4 import BeautifulSoup

data = re.sub(r"\b(plan|feature|measure)\s+([^:\s]+).*", r'<\g<1> name="\g<2>">', text)
data = re.sub(r"\b(?:end)(plan|feature|measure).*", r"</\g<1>>", data)
data = re.sub(r'src\s*=\s*((?:"[^"]+"\s*)+)', r"<src>\g<1></src>", data)

soup = BeautifulSoup(data, "html.parser")

for m in soup.select("measure"):
    # find parent PLAN:
    print("Plan:", m.find_parent("plan")["name"])
    # find feature PLAN:
    print("Parent Feature:", m.find_parent("feature")["name"])
    print("Name:", m["name"])
    for line in m.text.splitlines():
        data = list(map(str.strip, line.strip(' "').split(",")))
        if len(data) == 2:
            print(data)

转换后的文本将是:

<plan name="HELLO">
   <feature name="A">
       <measure name="X">
          <src>"Type ,Name"
       </src></measure>
                                                    
       <measure name="Y">
        <src>"Type ,Name"
       </src></measure>
                                                    
       <feature name="Aa">
           <measure name="AaX">
              <src>"Type ,Name"                
                    "Type ,Name2"
                    "Type ,Name3"
           </src></measure>

           <measure name="AaY">
              <src>"Type ,Name"
                    "Type ,Name2"
                    "Type ,Name3"
           </src></measure>

           <feature name="Aab">
              .....
           </feature>

       </feature>

   </feature>

   <feature name="B">
     ......
   </feature>
</plan>

<plan name="HOLA">
</plan>

并输出:

Plan: HELLO
Parent Feature: A
Name: X
['Type', 'Name']
Plan: HELLO
Parent Feature: A
Name: Y
['Type', 'Name']
Plan: HELLO
Parent Feature: Aa
Name: AaX
['Type', 'Name']
['Type', 'Name2']
['Type', 'Name3']
Plan: HELLO
Parent Feature: Aa
Name: AaY
['Type', 'Name']
['Type', 'Name2']
['Type', 'Name3']
0赞 trincot 10/5/2023 #2

我不明白你为什么需要调用 ,或者 ,也不明白你为什么要尝试创建一个密钥,但看到你之前的问题,我会通过制作一个列表来扩展之前的答案,以便它可以收集多行数据:replacesource;MEASURED_VALsrc

def getplans(s):
    stack = [{}]
    stack_list = None
    
    for line in s.splitlines():
        if "=" in line:  # leaf
            key, value = line.split("=", 1)
            stack_list = [value.strip(' "')]  # create list for multiple entries
            stack[-1][key.strip()] = stack_list
        elif line.strip()[:3] == "end":
            stack.pop()
            stack_list = None
        elif stack_list is not None:  # continuation of leaf data
            stack_list.append(line.strip(' "'))  # extend the list for `src`
        elif line.strip():
            collection, name, *_ = line.split()
            stack.append({})
            stack[-2].setdefault(collection, {})[name] = stack[-1] 
    return stack[0]
0赞 errorcode505 10/5/2023 #3

您似乎正在尝试解析此文件的结构并创建一个方便的数据结构来表示计划、要素和度量的层次结构。您当前的方法使用堆栈来跟踪嵌套结构,这是非常合理的。

有几点需要注意:

您尝试从键和值中删除“source”、“=”、“”、“和 ”;“ 等字符看起来有些不必要。如果没有具体原因,最好将它们保留为原始形式以保持数据完整性。

确保正确处理块的末尾(例如,“endmeasure”和“endfeature”)非常重要。当遇到块的末尾时,将逻辑添加到堆栈中的弹出元素将有助于保持正确的嵌套。

下面是代码的更新版本,考虑了以下注意事项:

def parse_file(s):
stack = []
data = {}

for line in s.splitlines():
    line = line.strip()
    
    if line.startswith("plan"):
        plan_name = line.split()[1]
        data[plan_name] = {}
        stack.append(data[plan_name])
    elif line.startswith("feature"):
        feature_name = line.split()[1]
        data[plan_name][feature_name] = {}
        stack.append(data[plan_name][feature_name])
    elif line.startswith("measure"):
        measure_name = line.split()[1]
        data[plan_name][feature_name][measure_name] = {}
        stack.append(data[plan_name][feature_name][measure_name])
    elif line.startswith("endmeasure") or line.startswith("endfeature"):
        stack.pop()
    elif line.startswith("endplan"):
        stack.pop()
        plan_name = None

return data

此代码创建一个数据结构,用于反映输入文件中的计划、功能和度量值。您可以根据需要使用此数据结构进行进一步的数据操作。