Python - 将结构化文本转换并过滤为对象-解网

问：

当前问题。我正在处理一组数据文件，这些文件基本上看起来像这样：

{39107,
    {31685,
        {   f24c4ec6-1e59-47a0-9736-8c823eda0d28,
            "N",
            7
        },
        {   c71dce36-4295-49e4-be03-7c60969b96c3,
            "A",
            8
        },
        {   f80fce14-f001-4b20-84d5-7a00f0788f6b,
            "A",
            9
        },
    }
}

和

{0,
    {4659,
        {
                        7c90ea6a-12f5-4c54-bfe0-e38120a6e364,
                        "fieldname27472",
                        "N",
                        27472,
                        "",
                        {3,
                                {"field1",
                                        0,
                                        {1,
                                                {
                                                        "B",
                                                        16,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                                {"field2",
                                        0,
                                        {1,
                                                {
                                                "T",
                                                0,
                                                0,
                                                "",
                                                0}
                                        },
                                        "",
                                        0
                                },
                                {"field3",
                                        0,
                                        {1,
                                                {
                                                        "L",
                                                        0,
                                                        0,
                                                        "",
                                                        0
                                                }
                                        },
                                        "",
                                        0
                                },
                        },
                        {0},
                        {1,
                                {
                                        edcba,
                                        "ByID",
                                        abcde,
                                        1,
                                        {1,
                                                "ID"
                                        },
                                        1,
                                        0,
                                        0
                                }
                        },
                        1,
                        "S",
                        {0},
                        {0},
                        "",
                        0,
                        0
                }
        }
}

数据集前的数字，例如 4659 表示以下数据容器的数量。某些值未括在引号中，例如本例中的 uuid，或随机字符串。

我的目标是在 python 对象（如列表或元组）中转换这些数据结构，然后将它们转换为 JSON 进行外部处理。

现在我有一个 2 阶段的过程。 Stage1 进行初始转换和数据评估。 Stage2 筛选数据，删除多余的值（例如实际元素之前的元素数）和嵌套列表。

import json

file = 'stack1.json'

def stage1(msg):
    buffer = ''
    st,fh,delim,encase = '[',']',',', '"'
    msg = msg.translate(str.maketrans('{}',st+fh)).replace('\n', '').replace('\r', '').replace('\t', '')
    while True:
        fhpos = msg.find(fh)
        if fhpos >= 0:
            head = msg[:fhpos+1]
            if head:
                stpos = head.rfind(st)
                if stpos>=0:
                    teststring = head[stpos+1:fhpos].split(delim)
                    for idx,sent in enumerate(teststring):
                        if not (sent.startswith(encase) or sent.endswith(encase)) or sent.count('-') == 4:
                            teststring[idx] = (f'"{teststring[idx]}"')
                            break
                    buffer+= head[:stpos+1]+','.join(teststring)+fh
                else: buffer+=fh
            msg = msg[fhpos+1:]
        else:
            break
    return buffer

def stage2(lst):
    if not any([isinstance(i,list) for i in lst]):
        return tuple(lst)
    if not isinstance(lst[0],list) and all([isinstance(j,list) for j in lst[1:]]):
        lst = stage2(lst[1:])
        if all([isinstance(j,(list,tuple)) for j in lst]) and len(lst) == 1:
            lst, = lst
    for idx,i in enumerate(lst):
        if isinstance(i,list):
            lst[idx] = stage2(i)
        else:
            continue
    return stage2(lst)

with open(file, 'r') as f:
    data = f.read()
    try:
        s1 = stage1(data)
        print("STAGE1\n",s1)
        s2 = stage2(json.loads(s1))
        print("STAGE2\n",json.dumps(s2, indent=2))
    except Exception as e: print(e)

目前结果：

示例1：

STAGE1
[39107,[31685,["f24c4ec6-1e59-47a0-9736-8c823eda0d28","N",7],["c71dce36-4295-49e4-be03-7c60969b96c3","A",8],["f80fce14-f001-4b20-84d5-7a00f0788f6b","A",9]]]
STAGE2
 [
  [
    "f24c4ec6-1e59-47a0-9736-8c823eda0d28",
    "N",
    7
  ],
  [
    "c71dce36-4295-49e4-be03-7c60969b96c3",
    "A",
    8
  ],
  [
    "f80fce14-f001-4b20-84d5-7a00f0788f6b",
    "A",
    9
  ]
]

示例2：

STAGE1
[0,[4659,[7c90ea6a-12f5-4c54-bfe0-e38120a6e364,"fieldname27472","N",27472,"",[3,[aa-aa-a-a-a,"field1",0,[1,["B","16",0,"",0]]],["field2",0,[1,["T","0",0,"",0]]],["field3",0,[1,["L","0",0,"",0]]]],["0"],[1,[edcba,"ByID",abcde,1,["1","ID"]]],1,"S",["0"],["0"]]]]
STAGE2
Expecting ',' delimiter: line 1 column 12 (char 11)

示例 2 失败，因为并非所有值都带有引号。

哪些库可能适合这种情况？数据集相当大，目前第一个示例是 ~5M 个字符，stage1 最多需要 1 分钟来处理。

未来问题：像这样转换和过滤数据的最佳方法是什么？我认为在同一次传递中转换 AND 过滤更快，而不是多次执行完全扫描。我读过关于 PLY 和 PEG 的文章，但我认为这不是适合这项工作的工具。

json python-3.x 解析数据结构

用作数组边界的大括号应替换为方括号
应删除尾随逗号（在最后一个数组元素之后）
十六进制值（可能包括连字符）应该用引号引起来（或者，它们可以用前缀编码，但较长的数字序列必须分解成多个部分，所以我不会那样做）。0x

我还假设：

充当数组边界的左大括号将始终出现在行首（忽略间距）
充当数组边界的右大括号（可能紧随其后的尾随逗号）将始终出现在行的末尾。
不带引号的十六进制值将显示在行首（忽略空格），但一个左大括号除外，它可以出现在值之前。

如果所有这些假设都是正确的，那么以下应该有效：

import re
import json

def process(s):
    # replace braces with square brackets
    s = re.sub(r"^(\s*){\n?", r"\1[\n", s, flags=re.M)
    s = re.sub(r"}(,?)$", r"]\1", s, flags=re.M)
    # remove trailing commas (not valid in JSON)
    s = re.sub(r",$(\s+])", r"\1", s, flags=re.M)
    # wrap hex in quotes
    s = re.sub(r'^(\s*)(?=.*[\-a-z])([\w\-]+)', r'\1"\2"', s, flags=re.M)
    return json.loads(s)

with open("stack.json", 'r') as f:
    data = process(f.read())
    print(data)

Python - 将结构化文本转换并过滤为对象

Python - Convert and filter structured text into object

评论

评论