用于解析格式错误的 python dict 的正则表达式-解网

问：

我有一个JSON列表，如下所示：

{
    'title' : 'Lorem Ipsum',
    'title2' : '_(Lorem Ipsum)',
    'description' : 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
         <br>' +
        'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages',
    'urls' : [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
},

我想要一个有效的JSON，这样：

    {
        "title" : "Lorem Ipsum",
        "title2" : "_(Lorem Ipsum)",
        "description" : "Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.
             <br>' +
            'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages",
        "urls" : [
            "https://www.lipsum.com/",
            "http://lv.lipsum.com/",
            "http://nl.lipsum.com/",
            "http://ro.lipsum.com/"
        ]
    }

我正在使用这个 python 正则表达式，但它不匹配和字段："description""urls"

# import re
regex = re.compile(r'\'([a-zA-Z]*)\' : \'(.*)\',', re.MULTILINE)
res = re.sub(regex, r'"\1": "\2",', d)

Python 正则表达式

如果这是不可能的，你可以用正则表达式来屠杀它，但如果你的输入数据以错误的方式改变，这将是一个丑陋的黑客攻击，很可能会悄无声息地破坏。特别是，在此示例中，如果描述字段包含单引号后跟逗号，或者任何字段或名称包含双引号，则它将中断。

#!/usr/bin/env python3
import re
import json

garbage = """
{
    'title' : 'Lorem Ipsum',
    'title2' : '_(Lorem Ipsum)',
    'description' : 'Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's
         <br>' +
        'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was pop ularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages',
    'urls' : [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
},
"""

res = garbage
field_regex = re.compile(r"'([^']+)'\s*:\s*'(.+?)',", re.MULTILINE | re.DOTALL)
res = re.sub(field_regex, r'"\1": "\2",', res)
list_regex = re.compile(r"'([^']+)'\s*:\s*\[", re.MULTILINE)
res = re.sub(list_regex, r'"\1": [', res)
res = re.sub("\n", " ", res, re.MULTILINE)
res = re.sub(",\s*$", "", res)

j = json.loads(res)
print(json.dumps(j, indent=4))

这会屠杀垃圾输入并将其解析为 JSON，然后打印它：

{
    "title": "Lorem Ipsum",
    "title2": "_(Lorem Ipsum)",
    "description": "Lorem Ipsum is simply dummy text of the printing and typesetting industry.  <br>' + Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.          <br>' +         'It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages",
    "urls": [
        "https://www.lipsum.com/",
        "http://lv.lipsum.com/",
        "http://nl.lipsum.com/",
        "http://ro.lipsum.com/"
    ]
}

与您的原始代码相比，我更改了这些详细信息：

为了便于阅读：要定义正则表达式，如果正则表达式包含单引号，请使用双引号，这样您就不必转义单引号。
替换了字段名称的匹配项，以查找单引号而不是字母，以使其更加可靠。随意将其改回原处。
再次将单个空格替换为捕获零到多个空格，以尝试更健壮并忽略空格更改。\s*
添加了匹配换行符。re.DOTALL
添加了另一个正则表达式以匹配列表/数组结构。
添加了另一个正则表达式以将换行符替换为空格，因为换行符等控制字符在 JSON 字段中无效。我假设换行符在这里无关紧要。如果是这样，则需要用引号换行符替换它们。
删除了输入末尾的尾随逗号。您还可以告诉 JSON 解析器在解析时要宽松，这将允许尾随逗号。

同样，这是一个可怕的黑客攻击。不应在生产环境中使用它。修复此输出的生成者。

上一个：Selenium Firefox 在下载文件后卡住

下一个：如何使用 Selenium 在 Chrome Headless 中启用自动翻译？

用于解析格式错误的 python dict 的正则表达式

Regex to parse malformed python dict

评论