提问人:M-- 提问时间:3/11/2023 最后编辑:M-- 更新时间:4/9/2023 访问量:125
以表格格式提取 XML 数据
Extract XML data in tabular format
问:
我有一个xml文件,我想从中提取数据。最终,我需要的是一个显示节点名称的表(即 和 ) 替换为表中的信息(请参阅下面的所需输出)。NODE36
NODE44
有没有办法使用或 XML 解析器将数据提取为表?regex
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<name>culverts.XML</name>
<StyleMap id="m_ylw-pushpin29">
<Pair>
<key>normal</key>
<styleUrl>#s_ylw-pushpin00</styleUrl>
</Pair>
<Pair>
<key>highlight</key>
<styleUrl>#s_ylw-pushpin_hl25</styleUrl>
</Pair>
</StyleMap>
<Folder>
<name>culverts.XML</name>
<open>1</open>
<description>Culvert</description>
<Placemark>
<name>NODE36</name>
<description><![CDATA[<br><br><br>
<table border="1" padding="0">
<tr><td>Objectid</td><td>1</td></tr>
<tr><td>On_route</td><td>Mid Turnpike</td></tr>
<tr><td>Road_numbe</td><td>54</td></tr>
<tr><td>Recommenda</td><td>Continue to monitor.</td></tr>]]></description>
<styleUrl>#m_ylw-pushpin29</styleUrl>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-74.249045,45.997986,0</coordinates>
</Point>
</Placemark>
<Placemark>
<name>NODE44</name>
<description><![CDATA[<br><br><br>
<table border="1" padding="0">
<tr><td>Objectid</td><td>2</td></tr>
<tr><td>On_route</td><td>Mid Turnpike</td></tr>
<tr><td>Road_numbe</td><td>54</td></tr>
<tr><td>Recommenda</td><td>Not Available.</td></tr>]]></description>
<styleUrl>#m_ylw-pushpin29</styleUrl>
<Point>
<extrude>1</extrude>
<altitudeMode>relativeToGround</altitudeMode>
<coordinates>-74.24906300000001,45.998057,0</coordinates>
</Point>
</Placemark>
</Folder>
</Document>
期望输出:
名字 | 对象 ID | On_route | Road_numbe | 推荐 |
---|---|---|---|---|
节点36 | 1 | 中收费公路 | 54 | 继续监控。 |
节点44 | 2 | 中收费公路 | 54 | 不可用。 |
我试图在两者之间提取数据,但无济于事;regex
<Placemark>
</Placemark>
library(qdapRegex)
my_tbl <- rm_between(file_str, 'Placemark', '/Placemark', extract=TRUE)[[1]]
或
my_tbl <- str_extract_all(file_str, "Placemark((.|\n)*)/Placemark")
Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)
我无法让这个在 R 中工作。尽管即使我可以,它也会将第一次出现与最后一次出现 ;请看这里: https://regex101.com/r/bQOdDJ/1<Placemark>
</Placemark>
答:
1赞
Onyambu
3/11/2023
#1
library(rvest)
library(tidyverse)
read_html(your_page, options = "HUGE")%>%
html_node('table')%>%
html_table(fill = TRUE) %>%
mutate(row = cumsum(X1 =='Objectid'))%>%
pivot_wider(names_from = X1, values_from = X2)%>%
type.convert(as.is =TRUE)
# A tibble: 3 × 5
row Objectid On_route Road_numbe Recommenda
<int> <int> <chr> <int> <chr>
1 1 1 Mid Turnpike 54 Continue to monitor.
2 2 2 Mid Turnpike 54 Not Available.
评论
1赞
M--
3/11/2023
我进行了编辑并添加了 .这现在就像一个魅力(只是不给我专栏')。options = "HUGE"
read_html
name
2赞
MrFlick
3/11/2023
#2
下面是一个带有帮助程序函数的方法,用于将 HTML 表转换为 data.frame。基本上,我们需要对 HTML 数据进行大量迭代和解析。
library(xml2)
library(purrr)
doc <- xml2::read_xml(xx)
table_to_dataframe <- function(x) {
x |> xml_find_all(".//tr") |>
map(function(x) {
x |> xml_find_all("./td") |> xml_text()
}) |>
do.call("rbind", args=_) |>
(function(x) setNames(x[,2], x[,1]))() |>
bind_rows()
}
doc |>
xml_find_all("//Placemark") |>
map_df(function(p) {
name <- p |> xml_find_first("./name") |> xml_text()
sub <- p |> xml_find_first("./description") |> xml_text() |> read_html()
bind_cols(tibble(name), table_to_dataframe(sub))
})
哪个返回
name Objectid On_route Road_numbe Recommenda
<chr> <chr> <chr> <chr> <chr>
1 NODE36 1 Mid Turnpike 54 Continue to monitor.
2 NODE44 2 Mid Turnpike 54 Not Available.
1赞
Michael Kay
3/11/2023
#3
由于嵌套的 HTML,这非常棘手,但这里有一个 XSLT 解决方案,它使用了为 XSLT 4.0 建议并在 Saxon 12 中实现的函数:parse-html()
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="4.0" expand-text="yes">
<xsl:output method="html" indent="yes"/>
<xsl:variable name="tables">
<xsl:for-each select="//Placemark">
<data>
<xsl:copy-of select="name"/>
<description>
<xsl:sequence select="parse-html(description)//*:table"/>
</description>
</data>
</xsl:for-each>
</xsl:variable>
<xsl:template match="/">
<table>
<thead>
<tr>
<th>Name</th>
<xsl:for-each select="$tables/data[1]//*:tr">
<th>{*:td[1]}</th>
</xsl:for-each>
</tr>
</thead>
<tbody>
<xsl:for-each select="$tables/data">
<tr>
<td>{name}</td>
<xsl:for-each select="description//*:tr">
<td>{*:td[2]}</td>
</xsl:for-each>
</tr>
</xsl:for-each>
</tbody>
</table>
</xsl:template>
</xsl:stylesheet>
输出为:
<table>
<thead>
<tr>
<th>Name</th>
<th>Objectid</th>
<th>On_route</th>
<th>Road_numbe</th>
<th>Recommenda</th>
</tr>
</thead>
<tbody>
<tr>
<td>NODE36</td>
<td>1</td>
<td>Mid Turnpike</td>
<td>54</td>
<td>Continue to monitor.</td>
</tr>
<tr>
<td>NODE44</td>
<td>2</td>
<td>Mid Turnpike</td>
<td>54</td>
<td>Not Available.</td>
</tr>
</tbody>
</table>
评论