检查 XML 文件中的数据项，在 Excel 文件中侦听，如果存在匹配项，则返回列表 - 代码不起作用，找不到匹配项

Check an XML file for data items listen in an Excel file and return a listing if there is a match - Code is not working does not find match

提问人：Janek 提问时间：10/3/2023 最后编辑：Janek 更新时间：10/4/2023 访问量：30

问：

阿罗哈，谢谢你帮助我解决这个问题。我正在学习编写 Python 代码并遇到挑战。我在这里试图实现的是以下目标：

我有一个 excel 文件（xlsx），其中包含我将在 XML 数据文件中查找的元素列表（必须在报告文件中的元素列表）。 --如果每个行元素在具有完全匹配名称的 XML 文件中至少出现一次，我需要 cehck。 -- 作为运行程序的结果，我需要创建一个列表，在其中我可以看到至少一次可以在 XML 数据中找到的元素以及缺少哪些元素，并将结果写入/导出到新的 excel 文件中。

excel 文件中的代码片段： A9：使用的 Excel 文件中的 addrAtDxState 代码片段

我有一个基于 XML 的数据文件，它有大量数据行，excel 文件中的每个项目都是 XML 数据库中具有 NAACCRID 的行项目。

XML 中的代码片段（实际值被删除）：

<?xml version="1.0" encoding="utf-8"?>
<NaaccrData baseDictionaryUri="" recordType="" timeGenerated="" specificationVersion="" xmlns="">
  <Item naaccrId="recordType"></Item>
  <Item naaccrId="naaccrRecordVersion"></Item>
  <Item naaccrId="registryId"></Item>
  <Patient>
    <Item naaccrId="birthplaceCountry"></Item>
    <Item naaccrId="birthplaceState"></Item>
    <Item naaccrId="causeOfDeath"></Item>
    <Item naaccrId="dateOfBirth"></Item>
    <Item naaccrId="dateOfLastContact"></Item>
    <Item naaccrId="icdRevisionNumber"></Item>
    <Item naaccrId="patientIdNumber"></Item>
    <Item naaccrId="race1"></Item>
    <Item naaccrId="race2"></Item>
    <Item naaccrId="race3"></Item>
    <Item naaccrId="race4"></Item>
    <Item naaccrId="race5"></Item>
    <Item naaccrId="sex"></Item>
    <Item naaccrId="spanishHispanicOrigin"></Item>
    <Item naaccrId="vitalStatus"></Item>
    <Tumor>
      <Item naaccrId="countyAtDxAnalysis"></Item>
      <Item naaccrId="addrAtDxPostalCode"></Item>
      <Item naaccrId="addrAtDxState"></Item>
      <Item naaccrId="ageAtDiagnosis"></Item>
      <Item naaccrId="behaviorCodeIcdO3"></Item>
      <Item naaccrId="casefindingSource"></Item>
      <Item naaccrId="censusTrCertainty2010"></Item>
      <Item naaccrId="censusTrPovertyIndictr"></Item>
      <Item naaccrId="censusTract2000"></Item>
      <Item naaccrId="censusTract2010"></Item>
      <Item naaccrId="censusTract2020"></Item>
      <Item naaccrId="cocAccreditedFlag"></Item>
    </Tumor>
  </Patient>
</NaaccrData>

因此，程序将查找addrAtDxState（取自excel文件），并检查它是否在XML文件中出现过一次，并且全名匹配。

如果是 - 将其添加到输出 excel 文件中的“找到的行”工作表中

如果不是 - 将该行项目添加到输出 excel 文件中的“未找到行”。

我厌倦了下面的 python 代码，这部分是我创建的，部分是我从阅读此处的条目中收集到的。

excel 和 xml 文件与我的 python 文件位于同一文件夹中（我使用 Jupyter 进行编码）。

代码运行并完成比较，但在导出的 excel 文件中，所有元素都在“未找到行”工作表中，因此代码似乎没有从比较中找到任何“匹配项”。

我在 excel 文件中尝试了各种名称，我从 XML 文件中复制了“逐字”（行前有空格），有空格和没有空格......（如果你看一下上传的 excel 片段，你就会明白我的意思）。

我没有想法了。我无法解释为什么没有完成匹配（即使 excel 中有一条 1-1 的精确行，在 XML 中以相同的方式显示）并且代码无法识别是否存在匹配。

难道是XML中的元素在元素内容下吗？如果是，我如何告诉 python “打开它们并查看元素内容内部”？

任何意见/建设性的批评将不胜感激，谢谢，

import xml.etree.ElementTree as ET
import pandas as pd

def read_xml(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    data = []
    
    for row in root:
        data_row = {}
        for col in row:
            data_row[col.tag] = col.text
        data.append(data_row)
    
    return data

def read_excel(excel_file):
    df = pd.read_excel(excel_file)
    return df.to_dict(orient='records')

def check_rows_in_xml(xml_data, excel_data):
    results = []
    
    for excel_row in excel_data:
        excel_values = tuple(sorted(excel_row.items()))
        found_in_xml = any(excel_values == tuple(sorted(xml_row.items())) for xml_row in xml_data)
        results.append((excel_row, found_in_xml))
    
    return results

if __name__ == "__main__":
    xml_file = 'Data.xml'
    excel_file = 'List4.xlsx'
    
    xml_data = read_xml(xml_file)
    excel_data = read_excel(excel_file)
    
    results = check_rows_in_xml(xml_data, excel_data)
    
    # Separate the results into found and not found rows
    found_rows = [dict(row) for row, found in results if found]
    not_found_rows = [dict(row) for row, found in results if not found]
    
    # Create a Pandas DataFrame for the found rows
    found_df = pd.DataFrame(found_rows)
    
    # Create a Pandas DataFrame for the not found rows
    not_found_df = pd.DataFrame(not_found_rows)
    
    # Write the DataFrames to an Excel file
    output_excel_file = 'output_results.xlsx'
    with pd.ExcelWriter(output_excel_file) as writer:
        found_df.to_excel(writer, sheet_name='Found Rows', index=False)
        not_found_df.to_excel(writer, sheet_name='Not Found Rows', index=False)
    
    print(f"Results saved to {output_excel_file}")

python xml 解析

0赞 Hermann12 10/3/2023

请不要分享数据图片，我们不会输入您的xml文件。改为将 xml 代码段共享为文本。

0赞 Janek 10/4/2023

@Hermann12 谢谢你在这方面对我的教育。我已经删除了XML图片并将其作为文本/代码添加。我不确定 excel 表，所以我只是保持原样（图像片段）。如果时间允许，您能看一下代码并帮助我弄清楚我在这里遗漏了什么吗？谢谢

答： 暂无答案

上一个：Groovy - 删除 XML 有效负载中的非唯一值

下一个：Jackson XML 按条目的属性对数组进行排序

检查 XML 文件中的数据项，在 Excel 文件中侦听，如果存在匹配项，则返回列表 - 代码不起作用，找不到匹配项

Check an XML file for data items listen in an Excel file and return a listing if there is a match - Code is not working does not find match

评论