提问人:JM Nel 提问时间:10/31/2023 更新时间:11/1/2023 访问量:28
将“XML Spreadsheet 2003”解析为 Pandas 数据帧
Parsing "XML Spreadsheet 2003" into a Pandas dataframe
问:
我有 6000+ XML 电子表格文件,我想将其解析为 Pandas DataFrames。最好是我想使用 pd.read_xml 方法执行此操作并提供正确的参数(x-path?我想避免使用 lxml 解析它。我需要它尽可能地高性能。
我的文件示例:
<?mso-application progid='Excel.Sheet'?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Styles>
<Style ss:ID="Default" ss:Name="Normal">
<Alignment ss:Vertical="Bottom"/>
</Style>
<Style ss:ID="ShortDate">
<NumberFormat ss:Format="Short Date"/>
</Style>
<Style ss:ID="Titles">
<Font ss:Bold="1"/>
<Interior ss:Color="#C0C0C0" ss:Pattern="Solid"/>
</Style>
<Style ss:ID="ColorShortDate">
<Font ss:Color="#808080"/>
<NumberFormat ss:Format="Short Date"/>
</Style>
<Style ss:ID="MakeCellColor">
<Font ss:Color="#808080"/>
</Style>
<Style ss:ID="DateTimeStamp">
<NumberFormat ss:Format="dd/mm/yyyy'' hh:mm:ss"/>
</Style>
</Styles>
<Worksheet ss:Name="Details" xml:lang="en-US" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:Composites="Composites" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
<Table>
<Column ss:Width="114.75"/>
<Column ss:Width="116.25"/>
<Column ss:Width="53.25"/>
<Column ss:Width="171.75"/>
<Row ss:StyleID="Titles">
<Cell>
<Data ss:Type="String">ShortName</Data>
</Cell>
<Cell>
<Data ss:Type="String">FieldCode</Data>
</Cell>
<Cell>
<Data ss:Type="String">Date</Data>
</Cell>
<Cell>
<Data ss:Type="String">Value</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">Calculate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">yes</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">Calendar</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">Working Days</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">FromCurrency</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">EUR</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">TranslatableName</Data>
</Cell>
<Cell>
<Data ss:Type="String">Inception</Data>
</Cell>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">ToCurrency</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">CHF</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">InceptionDate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">1995-01-01</Data>
</Cell>
</Row>
<Row>
<Cell>
<Data ss:Type="String">CHF vs EUR</Data>
</Cell>
<Cell>
<Data ss:Type="String">LastModificationDate</Data>
</Cell>
<Cell>
<Data ss:Type="String"/>
</Cell>
<Cell>
<Data ss:Type="String">2011-08-19</Data>
</Cell>
</Row>
</Table>
<WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
<Selected/>
<FreezePanes/>
<SplitHorizontal>1</SplitHorizontal>
<TopRowBottomPane>1</TopRowBottomPane>
<ActivePane>2</ActivePane>
</WorksheetOptions>
</Worksheet>
</Workbook>
我期待的 DataFrame 是:
简称 | 字段代码 | 日期 | 价值 |
---|---|---|---|
瑞士法郎兑欧元 | 算 | 是的 | |
瑞士法郎兑欧元 | 日历 | 工作日 | |
瑞士法郎兑欧元 | FromCurrency(英语:FromCurrency) | 欧元 | |
瑞士法郎兑欧元 | 可翻译名称 | 初始 | 瑞士法郎兑欧元 |
瑞士法郎兑欧元 | ToCurrency(货币) | 瑞士法郎 | |
瑞士法郎兑欧元 | 成立日期 | 1995-01-01 | |
瑞士法郎兑欧元 | LastModificationDate | 2011-08-19 |
答:
0赞
Timeless
11/1/2023
#1
如果你坚持只使用 pandas(即使 lxml 是在后台使用的),你可以尝试:
N = 4 # nb of cols
xp = ".//ss:Row/ss:Cell"
ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}
tmp = pd.read_xml("file.xml", xpath=xp, namespaces=ns).squeeze()
df = pd.DataFrame(np.reshape(tmp[N:], (-1, N)), columns=tmp[:N]).fillna("")
或者这种通用方法:
from lxml import etree
rows = etree.parse("file.xml").xpath(".//ss:Row", namespaces=ns)
headata = [[col.text for col in row.xpath(
"./ss:Cell/ss:Data", namespaces=ns)] for row in rows]
df = pd.DataFrame(headata[1:], columns=headata[0]).fillna("")
输出:
print(df)
Data ShortName FieldCode Date Value
0 CHF vs EUR Calculate yes
1 CHF vs EUR Calendar Working Days
2 CHF vs EUR FromCurrency EUR
3 CHF vs EUR TranslatableName Inception CHF vs EUR
4 CHF vs EUR ToCurrency CHF
5 CHF vs EUR InceptionDate 1995-01-01
6 CHF vs EUR LastModificationDate 2011-08-19
[7 rows x 4 columns]
评论