提问人:BleepBloop 提问时间:7/26/2022 最后编辑:PhilBleepBloop 更新时间:7/27/2022 访问量:48
在通过 R 中的 flatxml 包扁平化之前删除 XML 中的节点
Removing a node in an XML before flattening via the flatxml package in R
问:
我有多个 XML 想要从中提取数据,这些数据是使用机器学习库从 PDF 创建的。
在 R 中使用 flatxml 包时,我在导入 XML 时遇到了问题,每当节点内有标记时,它就会在节点中的所有数据被切断。因此,在下面的示例 XML 中,所有内容都从fxml_importXMLFlat()
<ref>
<p>
</ref>
<p>
.主要是无定形的(非结晶的......”
直到被切断。</p>
<text xml:lang="en">
<body>
<div xmlns="http://www.tei-c.org/ns/1.0">
<head n="1.">Introduction</head>
<p>
Thermal properties of confectionary products, including the melting temperature (Tm) of crystalline components and glass transition temperature (Tg) of amorphous components, as well as the crystalline to amorphous ratio, significantly impact system texture and stability
<ref type="bibr" target="#b23">(Levine and Slade, 1986)</ref>
. Predominantly amorphous (non-crystalline, disordered solid) candies are formed by heating ingredients to a set temperature and then quickly cooling the resultant supersatured sugar solution to below the temperature range in which recrystallization of sugars can occur, between Tg and Tm of the material.
</p>
</body>
</text>
为了解决这个问题,我计划最初将文件作为 XML 导入 R,删除标签,然后通过 flatxml 扁平化。<ref>
我尝试使用XML包使用以下代码查找和删除标签:<ref>
xml1 <- read_xml("https://file.io/vAiDRi5s68Gm")
ref <- xml_find_all(xml1, "//ref")
rm(ref)
它在 ref 对象中不返回任何内容。当我在读入后查看 xml 时,它看起来也不像任何标签。<ref>
我也试过了,但似乎也没有找到标签。 xml1 <- xmltoList("https://file.io/vAiDRi5s68Gm")
<ref>
我也尝试过,但出现以下错误: xml1 <- xmlToDateFrame("https://file.io/vAiDRi5s68Gm")
错误 (, i, names(nodes[[i]]), value = c(text = “\n\t\t”, : 列的重复下标
[<-.data.frame
*tmp*
据我了解,这是因为XML文件是超级嵌套的。
我的目标是从数百个 XML 中提取数据,因此我需要一些可以应用于所有 XML 的东西,而不仅仅是一个特定的 XML 文件。任何想法将不胜感激!
答:
这是用于删除所有 ref 元素的 XSLT。
输入 XML
<text xml:lang="en">
<body>
<div xmlns="http://www.tei-c.org/ns/1.0">
<head n="1.">Introduction</head>
<p>Thermal properties of confectionary products, including the melting temperature (Tm) of crystalline components and glass transition temperature (Tg) of amorphous components, as well as the crystalline to amorphous ratio, significantly impact system texture and stability
<ref type="bibr" target="#b23">(Levine and Slade, 1986)</ref>. Predominantly amorphous (non-crystalline, disordered solid) candies are formed by heating ingredients to a set temperature and then quickly cooling the resultant supersatured sugar solution to below the temperature range in which recrystallization of sugars can occur, between Tg and Tm of the material.</p>
</div>
</body>
</text>
XSLT (英语:XSLT)
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ns1="http://www.tei-c.org/ns/1.0">
<xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<!-- remove ref elements -->
<xsl:template match="ns1:ref"/>
</xsl:stylesheet>
输出 XML
<text xml:lang="en">
<body>
<div xmlns="http://www.tei-c.org/ns/1.0">
<head n="1.">Introduction</head>
<p>Thermal properties of confectionary products, including the melting temperature (Tm) of crystalline components and glass transition temperature (Tg) of amorphous components, as well as the crystalline to amorphous ratio, significantly impact system texture and stability
. Predominantly amorphous (non-crystalline, disordered solid) candies are formed by heating ingredients to a set temperature and then quickly cooling the resultant supersatured sugar solution to below the temperature range in which recrystallization of sugars can occur, between Tg and Tm of the material.
</p>
</div>
</body>
</text>
评论
<div>