用于解析格式错误的 python dict 的正则表达式

Regex to parse malformed python dict

提问人:Sandy 提问时间:3/28/2023 最后编辑:frankfalseSandy 更新时间:3/29/2023 访问量:75

问:

我有一个JSON列表,如下所示:

{
    'title' : 'Lorem Ipsum',
    'title2' : '_(Lorem Ipsum)',
    'description' : 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
         <br>' +
        'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages',
    'urls' : [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
},

我想要一个有效的JSON,这样:

    {
        "title" : "Lorem Ipsum",
        "title2" : "_(Lorem Ipsum)",
        "description" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
             <br>' +
            'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages",
        "urls" : [
            "https://www.lipsum.com/",
            "http://lv.lipsum.com/",
            "http://nl.lipsum.com/",
            "http://ro.lipsum.com/"
        ]
    }

我正在使用这个 python 正则表达式,但它不匹配和字段:"description""urls"

# import re
regex = re.compile(r'\'([a-zA-Z]*)\' : \'(.*)\',', re.MULTILINE)
res = re.sub(regex, r'"\1": "\2",', d)
Python 正则表达式

评论

2赞 Silvio Mayolo 3/28/2023
您有一些引用不当的 json 并想更正引用?我不嫉妒你;这看起来不像是一个有趣的工作:(
0赞 Barmar 3/28/2023
这不是 JSON。它是 Python 代码(字符串中的换行符除外)。
1赞 Vercingatorix 3/28/2023
我错过了什么吗?你为什么不使用模块和 ?jsonjson.dumps()
0赞 Tim Roberts 3/29/2023
那条延续线是唯一的问题吗?整个结构看起来是有效的 Python 代码,您当然可以按照@Vercingatorix推荐使用。json.dumps

答:

0赞 Robert 3/29/2023 #1

你不应该使用正则表达式来“解析”它;正则表达式是这项工作的错误工具,因为它不理解输入的结构/语法。

我会尝试首先要求创建垃圾输出的人编写有效的 json 甚至 xml,任何您可以使用标准工具解析的内容。

如果这是不可能的,你可以用正则表达式来屠杀它,但如果你的输入数据以错误的方式改变,这将是一个丑陋的黑客攻击,很可能会悄无声息地破坏。特别是,在此示例中,如果描述字段包含单引号后跟逗号,或者任何字段或名称包含双引号,则它将中断。

#!/usr/bin/env python3
import re
import json

garbage = """
{
    'title' : 'Lorem Ipsum',
    'title2' : '_(Lorem Ipsum)',
    'description' : 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's
         <br>' +
        'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was pop ularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages',
    'urls' : [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
},
"""

res = garbage
field_regex = re.compile(r"'([^']+)'\s*:\s*'(.+?)',", re.MULTILINE | re.DOTALL)
res = re.sub(field_regex, r'"\1": "\2",', res)
list_regex = re.compile(r"'([^']+)'\s*:\s*\[", re.MULTILINE)
res = re.sub(list_regex, r'"\1": [', res)
res = re.sub("\n", " ", res, re.MULTILINE)
res = re.sub(",\s*$", "", res)

j = json.loads(res)
print(json.dumps(j, indent=4))

这会屠杀垃圾输入并将其解析为 JSON,然后打印它:

{
    "title": "Lorem Ipsum",
    "title2": "_(Lorem Ipsum)",
    "description": "Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.          <br>' +         'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages",
    "urls": [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
}

与您的原始代码相比,我更改了这些详细信息:

  • 为了便于阅读:要定义正则表达式,如果正则表达式包含单引号,请使用双引号,这样您就不必转义单引号。

  • 替换了字段名称的匹配项,以查找单引号而不是字母,以使其更加可靠。随意将其改回原处。

  • 再次将单个空格替换为捕获零到多个空格,以尝试更健壮并忽略空格更改。\s*

  • 添加了匹配换行符。re.DOTALL

  • 添加了另一个正则表达式以匹配列表/数组结构。

  • 添加了另一个正则表达式以将换行符替换为空格,因为换行符等控制字符在 JSON 字段中无效。我假设换行符在这里无关紧要。如果是这样,则需要用引号换行符替换它们。

  • 删除了输入末尾的尾随逗号。您还可以告诉 JSON 解析器在解析时要宽松,这将允许尾随逗号。

同样,这是一个可怕的黑客攻击。不应在生产环境中使用它。修复此输出的生成者。