将“XML Spreadsheet 2003”解析为 Pandas 数据帧

Parsing "XML Spreadsheet 2003" into a Pandas dataframe

提问人:JM Nel 提问时间:10/31/2023 更新时间:11/1/2023 访问量:28

问:

我有 6000+ XML 电子表格文件,我想将其解析为 Pandas DataFrames。最好是我想使用 pd.read_xml 方法执行此操作并提供正确的参数(x-path?我想避免使用 lxml 解析它。我需要它尽可能地高性能。

我的文件示例:

<?mso-application progid='Excel.Sheet'?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="Default" ss:Name="Normal">
            <Alignment ss:Vertical="Bottom"/>
        </Style>
        <Style ss:ID="ShortDate">
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="Titles">
            <Font ss:Bold="1"/>
            <Interior ss:Color="#C0C0C0" ss:Pattern="Solid"/>
        </Style>
        <Style ss:ID="ColorShortDate">
            <Font ss:Color="#808080"/>
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="MakeCellColor">
            <Font ss:Color="#808080"/>
        </Style>
        <Style ss:ID="DateTimeStamp">
            <NumberFormat ss:Format="dd/mm/yyyy'' hh:mm:ss"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="Details" xml:lang="en-US" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:Composites="Composites" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
        <Table>
            <Column ss:Width="114.75"/>
            <Column ss:Width="116.25"/>
            <Column ss:Width="53.25"/>
            <Column ss:Width="171.75"/>
            <Row ss:StyleID="Titles">
                <Cell>
                    <Data ss:Type="String">ShortName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FieldCode</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Date</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Value</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calculate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">yes</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calendar</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Working Days</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FromCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">TranslatableName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Inception</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">ToCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">InceptionDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">1995-01-01</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">LastModificationDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">2011-08-19</Data>
                </Cell>
            </Row>
        </Table>
        <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
            <Selected/>
            <FreezePanes/>
            <SplitHorizontal>1</SplitHorizontal>
            <TopRowBottomPane>1</TopRowBottomPane>
            <ActivePane>2</ActivePane>
        </WorksheetOptions>
    </Worksheet>
</Workbook>

我期待的 DataFrame 是:

简称 字段代码 日期 价值
瑞士法郎兑欧元 是的
瑞士法郎兑欧元 日历 工作日
瑞士法郎兑欧元 FromCurrency(英语:FromCurrency) 欧元
瑞士法郎兑欧元 可翻译名称 初始 瑞士法郎兑欧元
瑞士法郎兑欧元 ToCurrency(货币) 瑞士法郎
瑞士法郎兑欧元 成立日期 1995-01-01
瑞士法郎兑欧元 LastModificationDate 2011-08-19
pandas xpath 解析 xml-spreadsheet lxml

评论


答:

0赞 Timeless 11/1/2023 #1

如果你坚持只使用 pandas(即使 是在后台使用的),你可以尝试:

N = 4 # nb of cols
xp = ".//ss:Row/ss:Cell"
ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}

tmp = pd.read_xml("file.xml", xpath=xp, namespaces=ns).squeeze()
df = pd.DataFrame(np.reshape(tmp[N:], (-1, N)), columns=tmp[:N]).fillna("")

或者这种通用方法:

from lxml import etree

rows = etree.parse("file.xml").xpath(".//ss:Row", namespaces=ns)

headata = [[col.text for col in row.xpath(
    "./ss:Cell/ss:Data", namespaces=ns)] for row in rows]

df = pd.DataFrame(headata[1:], columns=headata[0]).fillna("")

输出:

print(df)

Data   ShortName             FieldCode       Date         Value
0     CHF vs EUR             Calculate                      yes
1     CHF vs EUR              Calendar             Working Days
2     CHF vs EUR          FromCurrency                      EUR
3     CHF vs EUR      TranslatableName  Inception    CHF vs EUR
4     CHF vs EUR            ToCurrency                      CHF
5     CHF vs EUR         InceptionDate               1995-01-01
6     CHF vs EUR  LastModificationDate               2011-08-19

[7 rows x 4 columns]