根据一些自定义规则创建嵌套词典

Create nested dictionary based on some custom rules

提问人:spectre 提问时间:10/26/2023 最后编辑:XMehdi01spectre 更新时间:11/7/2023 访问量:51

问:

我有一个python字典,如下所示:

ip_dict = {
    "img_folder/144-64ee3d9bb7-3.png": "COMMERCIAL PROPERTY ",
    "img_folder/144-64ee3d9bb7-2.png": "CBIC COMMERCIAL ",
    "img_folder/144-64ee3d9bb7-4.png": "CBIC COMMERCIAL GENERAL",
    "img_folder/144-64ee3d9bb7-1.png": "Contractors Bonding",
    "img_folder/144-64ee3d9bb7-5.png": "CBIC",
    "img_folder/Excess-Liability-8.png": "  Energy laswance ",
    "img_folder/144-64ee3d9bb7-0.png": "CONTRACTORS BONDING AND INSURANCE ",
    "img_folder/Excess-Liability-10.png": "  FOLLOWING FORM",
    "img_folder/Excess-Liability-14.png": "  (2) property and",
    "img_folder/Excess-Liability-0.png": "  Energy ",
    "img_folder/Excess-Liability-5.png": "  The additional premium",
    "img_folder/Excess-Liability-3.png": "Ein Enos asurance Maral",
    "img_folder/Excess-Liability-4.png": "  IV. Conditions ",
    "img_folder/Excess-Liability-13.png": "  FOLLOWING FORM ",
    "img_folder/Excess-Liability-12.png": "  FOLLOWING FORM EXCESS",
    "img_folder/Excess-Liability-9.png": "  Surplus Lines",
    "img_folder/Excess-Liability-11.png": "  ALL OTHER TERMS",
    "img_folder/Excess-Liability-2.png": "  Il. Limit of",
    "img_folder/Excess-Liability-6.png": "  (G) Notice of",
    "img_folder/Excess-Liability-7.png": "Ss So Ss   The ",
    "img_folder/Excess-Liability-1.png": "eee ee ee"
}

它包含从 2 个不同的 pdf 文件(和 )的页面中提取的文本。我想将上面的字典转换为嵌套字典,其中全局键pdf名称嵌套字典与上面相同。因此,输出如下所示:144-64ee3d9bb7-3Excess-Liability

op_dict = {
    "144-64ee3d9bb7.png": {
    "img_folder/144-64ee3d9bb7-3.png": "COMMERCIAL PROPERTY ",
    "img_folder/144-64ee3d9bb7-2.png": "CBIC COMMERCIAL ",
    "img_folder/144-64ee3d9bb7-4.png": "CBIC COMMERCIAL GENERAL",
    "img_folder/144-64ee3d9bb7-1.png": "Contractors Bonding",
    "img_folder/144-64ee3d9bb7-5.png": "CBIC",
    "img_folder/144-64ee3d9bb7-0.png": "CONTRACTORS BONDING AND INSURANCE "
    },
    "Excess Liability.png": {
    "img_folder/Excess Liability-8.png": "  Energy laswance ",
    "img_folder/Excess Liability-10.png": "  FOLLOWING FORM",
    "img_folder/Excess Liability-14.png": "  (2) property and",
    "img_folder/Excess Liability-0.png": "  Energy ",
    "img_folder/Excess Liability-5.png": "  The additional premium",
    "img_folder/Excess Liability-3.png": "Ein Enos asurance Maral",
    "img_folder/Excess Liability-4.png": "  IV. Conditions ",
    "img_folder/Excess Liability-13.png": "  FOLLOWING FORM ",
    "img_folder/Excess Liability-12.png": "  FOLLOWING FORM EXCESS",
    "img_folder/Excess Liability-9.png": "  Surplus Lines",
    "img_folder/Excess Liability-11.png": "  ALL OTHER TERMS",
    "img_folder/Excess Liability-2.png": "  Il. Limit of",
    "img_folder/Excess Liability-6.png": "  (G) Notice of",
    "img_folder/Excess Liability-7.png": "Ss So Ss   The ",
    "img_folder/Excess Liability-1.png": "eee ee ee"
    }
}

我尝试了以下逻辑,但它没有按预期工作:

op_dict = {}
for key, value in ip_dict.items():
    doc_name = key.split("/")[-1]
    if doc_name not in op_dict:
        op_dict[doc_name] = {}
    op_dict[doc_name][key] = value

任何帮助都是值得赞赏的!

python 字典 嵌套 数据

评论


答:

0赞 Marcin Mrugas 10/26/2023 #1

您还需要删除末尾的数字并在文件名中添加扩展名。

op_dict = {}
for key, value in ip_dict.items():
    doc_name_with_number = key.split("/")[-1]
    array_without_number = doc_name_with_number.split("-")[:-1]
    doc_name = "-".join(array_without_number)
    doc_name_with_extension = f"{doc_name}.png"
    if doc_name_with_extension not in op_dict:
        op_dict[doc_name_with_extension] = {}
    op_dict[doc_name_with_extension][key] = value
0赞 krisstinkou 10/26/2023 #2

据我了解,您需要从文档名称中删除唯一编号。您可以按以下步骤操作(如果需要文件格式):

import re

op_dict = {}
for key, value in ip_dict.items():
    doc_name = key.split("/")[-1]
    doc_name = "".join(re.split(r"-\d+(\.\w+)$", doc_name))
    if doc_name not in op_dict:
        op_dict[doc_name] = {}
    op_dict[doc_name][key] = value

在这种情况下,您将获得以下名称:144-64ee3d9bb7.png, Excess-Liability.png

或者,如果您只需要名称(没有文件格式)

import re

op_dict = {}
for key, value in ip_dict.items():
    doc_name = key.split("/")[-1]
    doc_name = re.split(r"-\d+\.\w+$", doc_name)[0]
    if doc_name not in op_dict:
        op_dict[doc_name] = {}
    op_dict[doc_name][key] = value

在这种情况下,您将获得以下名称:144-64ee3d9bb7, Excess-Liability