使用 Python 解析 XML 文件并输出 JSON-解网

问：

我对 Python 很陌生。我目前正在尝试解析 xml 文件，获取它们的信息并将它们打印为 JSON。

我已经设法解析了xml文件，但我无法将它们打印为JSON。此外，在我的 printjson 函数中，该函数没有运行所有结果，只打印一次。parse 函数可以工作并运行所有输入文件，而 printjson 则没有。我的代码如下。

from xml.dom import minidom
import os
import json

#input multiple files
def get_files(d):
        return [os.path.join(d, f) for f in os.listdir(d) if os.path.isfile(os.path.join(d,f))]

#parse xml
def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        return NCT_ID,brief_title,official_title

#print result in json
def printjson(results):
        for result in results:
                output_json = json.dumps(result)
                print(output_json)

printjson(parse(get_files('my files path')))

运行文件时的输出

"NCT ID : NCT00571389"
"brief title : Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products"
"official title : A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"

预期输出

{
"NCT ID" : "NCT00571389",
"brief title" : "Isolation and Culture of Immune Cells and Circulating Tumor Cells From Peripheral Blood and Leukapheresis Products",
"official title" : "A Study to Facilitate Development of an Ex-Vivo Device Platform for Circulating Tumor Cell and Immune Cell Harvesting, Banking, and Apoptosis-Viability Assay"
}

我使用的示例索引 xml 文件被命名为 COVID-19 临床试验数据集，可以在 kaggle 中找到

python json dom xml 解析


def parse(files):
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)
        dicJson = {}
        dicJson.setdefault("NCT ID",tree.getElementsByTagName("nct_id")[0].firstChild.data)
        dicJson.setdefault("brief title",tree.getElementsByTagName("brief_title")[0].firstChild.data)
        dicJson.setdefault("official title", tree.getElementsByTagName("official_title")[0].firstChild.data)
    return dicJson

在函数 prinJson 中：

def printJson(results):
    # This function return the dictionary but in string, how to write to a JSON file.
    print(json.dumps(results))

1赞 Lee Kai Xuan 12/15/2022 #2

问题是您的函数返回得太早（它在从第一个 XML 文件中获取详细信息后返回。相反，应返回存储此信息的字典列表，以便列表中的每个项目都表示一个不同的文件，并且每个字典都包含有关相应 XML 文件的必要信息。parse

下面是更新后的代码：

def parse(files):
    xml_information = []
    for xml_file in files:
        
        #indentify all xml files
        tree = minidom.parse(xml_file)

        #Get some details
        NCT_ID = ("NCT ID : %s" % tree.getElementsByTagName("nct_id")[0].firstChild.data)
        brief_title = ("brief title : %s" % tree.getElementsByTagName("brief_title")[0].firstChild.data)
        official_title = ("official title : %s" % tree.getElementsByTagName("official_title")[0].firstChild.data)

        xml_information.append({"NCT_ID": NCT_ID, "brief title": brief_title, "official title": official_title})
    return xml_information

def printresults(results):
        for result in results:
                print(result)

printresults(parse(get_files('my files path')))

如果你绝对想将格式返回为 json，你可以在每个字典上使用类似的格式。json.dumps

注意：如果您有很多XML文件，我建议您在函数中使用，而不是返回整个字典列表，以提高速度和性能。yield

使用 Python 解析 XML 文件并输出 JSON

Parse XML file and output JSON with Python

评论

评论