将“XML Spreadsheet 2003”解析为 Pandas 数据帧-解网

问：

我有 6000+ XML 电子表格文件，我想将其解析为 Pandas DataFrames。最好是我想使用 pd.read_xml 方法执行此操作并提供正确的参数（x-path？我想避免使用 lxml 解析它。我需要它尽可能地高性能。

我的文件示例：

<?mso-application progid='Excel.Sheet'?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
    <Styles>
        <Style ss:ID="Default" ss:Name="Normal">
            <Alignment ss:Vertical="Bottom"/>
        </Style>
        <Style ss:ID="ShortDate">
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="Titles">
            <Font ss:Bold="1"/>
            <Interior ss:Color="#C0C0C0" ss:Pattern="Solid"/>
        </Style>
        <Style ss:ID="ColorShortDate">
            <Font ss:Color="#808080"/>
            <NumberFormat ss:Format="Short Date"/>
        </Style>
        <Style ss:ID="MakeCellColor">
            <Font ss:Color="#808080"/>
        </Style>
        <Style ss:ID="DateTimeStamp">
            <NumberFormat ss:Format="dd/mm/yyyy'' hh:mm:ss"/>
        </Style>
    </Styles>
    <Worksheet ss:Name="Details" xml:lang="en-US" xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:Composites="Composites" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet">
        <Table>
            <Column ss:Width="114.75"/>
            <Column ss:Width="116.25"/>
            <Column ss:Width="53.25"/>
            <Column ss:Width="171.75"/>
            <Row ss:StyleID="Titles">
                <Cell>
                    <Data ss:Type="String">ShortName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FieldCode</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Date</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Value</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calculate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">yes</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Calendar</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Working Days</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">FromCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">TranslatableName</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">Inception</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">ToCurrency</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">CHF</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">InceptionDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">1995-01-01</Data>
                </Cell>
            </Row>
            <Row>
                <Cell>
                    <Data ss:Type="String">CHF vs EUR</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String">LastModificationDate</Data>
                </Cell>
                <Cell>
                    <Data ss:Type="String"/>
                </Cell>
                <Cell>
                    <Data ss:Type="String">2011-08-19</Data>
                </Cell>
            </Row>
        </Table>
        <WorksheetOptions xmlns="urn:schemas-microsoft-com:office:excel">
            <Selected/>
            <FreezePanes/>
            <SplitHorizontal>1</SplitHorizontal>
            <TopRowBottomPane>1</TopRowBottomPane>
            <ActivePane>2</ActivePane>
        </WorksheetOptions>
    </Worksheet>
</Workbook>

我期待的 DataFrame 是：

简称	字段代码	日期	价值
瑞士法郎兑欧元	算		是的
瑞士法郎兑欧元	日历		工作日
瑞士法郎兑欧元	FromCurrency（英语：FromCurrency）		欧元
瑞士法郎兑欧元	可翻译名称	初始	瑞士法郎兑欧元
瑞士法郎兑欧元	ToCurrency（货币）		瑞士法郎
瑞士法郎兑欧元	成立日期		1995-01-01
瑞士法郎兑欧元	LastModificationDate		2011-08-19

pandas xpath 解析 xml-spreadsheet lxml

答：

0赞 Timeless 11/1/2023 #1

如果你坚持只使用 pandas（即使 lxml 是在后台使用的），你可以尝试：

N = 4 # nb of cols
xp = ".//ss:Row/ss:Cell"
ns = {"ss": "urn:schemas-microsoft-com:office:spreadsheet"}

tmp = pd.read_xml("file.xml", xpath=xp, namespaces=ns).squeeze()
df = pd.DataFrame(np.reshape(tmp[N:], (-1, N)), columns=tmp[:N]).fillna("")

或者这种通用方法：

from lxml import etree

rows = etree.parse("file.xml").xpath(".//ss:Row", namespaces=ns)

headata = [[col.text for col in row.xpath(
    "./ss:Cell/ss:Data", namespaces=ns)] for row in rows]

df = pd.DataFrame(headata[1:], columns=headata[0]).fillna("")

输出：

print(df)

Data   ShortName             FieldCode       Date         Value
0     CHF vs EUR             Calculate                      yes
1     CHF vs EUR              Calendar             Working Days
2     CHF vs EUR          FromCurrency                      EUR
3     CHF vs EUR      TranslatableName  Inception    CHF vs EUR
4     CHF vs EUR            ToCurrency                      CHF
5     CHF vs EUR         InceptionDate               1995-01-01
6     CHF vs EUR  LastModificationDate               2011-08-19

[7 rows x 4 columns]

上一个：使用 EXTRACTVALUE（或其他方法）在 Oracle sQL 中提取特定子节点

下一个：如何找到紧跟在上标后面的下标

将“XML Spreadsheet 2003”解析为 Pandas 数据帧

Parsing "XML Spreadsheet 2003" into a Pandas dataframe

评论