提问人:Robur_131 提问时间:10/9/2020 更新时间:10/9/2020 访问量:59
从 XML 文件的字段中删除标签
Removing tags from a field in an XML file
问:
我有一个XML文件,如下所示:
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>

<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>

<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="<p>In the middle of <em>The Dark Tournament</em>, Yusuke Urameshi gets to fully inherit Genkai's power of the <em>Spirit Wave</em> by absorbing a ball of energy from her.</p>

<p>However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.</p>

<p>My question is, why is it such a painful procedure to learn and absorb this power?</p>
" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="<yu-yu-hakusho>" AnswerCount="1" CommentCount="0" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="<p>In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons.</p>

<p>What's the significance of the watermelon and why does she carry one around?</p>
" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="<sora-no-otoshimono>" AnswerCount="2" CommentCount="1" />
具体来说,该文件包含许多行。每行都以一个标签开头。我想做的是捕获标签内的字段。例如,字段为 :row
Body
row
Body
Id = 2
"<p>In the middle of <em>The Dark Tournament</em>, Yusuke Urameshi gets to fully inherit Genkai's power of the <em>Spirit Wave</em> by absorbing a ball of energy from her.</p>

<p>However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.</p>

<p>My question is, why is it such a painful procedure to learn and absorb this power?</p>
"
我已经使用.接下来我要做的是解析每行字段中的单词。为此,我需要剥离任何 html 标签的字段。例如,剥离 html 标签的文本后,字段 of 应如下所示:Body
ElementTree
Body
Body
Body
Id = 2
In the middle of The Dark Tournament Yusuke Urameshi gets to fully inherit Genkai's power of the .... (continued)
到目前为止,我尝试过:
def remove_html_tags(text):
return bs4.BeautifulSoup(text, "html.parser").text
这导致:
pin the middle of emthe dark tournamentem yusuke urameshi gets to fully inherit genkais power of the emspirit waveem by absorbing a ball of energy from herp
phowever this process turns into an excruciating trial for yusuke almost killing him and keeping him doubled over in extreme pain for a long period of time so much so that his spirit animal poo is also in pain and flies to him to try to helpp
pmy question is why is it such a painful procedure to learn and absorb this powerp
如您所见,符号消失了,但包含在符号中的文本仍然存在。我该怎么做才能删除它们?
答:
1赞
MendelG
10/9/2020
#1
试试这个:
import re
from bs4 import BeautifulSoup
xml = """
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>

<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>

<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="<p>In the middle of <em>The Dark Tournament</em>, Yusuke Urameshi gets to fully inherit Genkai's power of the <em>Spirit Wave</em> by absorbing a ball of energy from her.</p>

<p>However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.</p>

<p>My question is, why is it such a painful procedure to learn and absorb this power?</p>
" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="<yu-yu-hakusho>" AnswerCount="1" CommentCount="0" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="<p>In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons.</p>

<p>What's the significance of the watermelon and why does she carry one around?</p>
" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="<sora-no-otoshimono>" AnswerCount="2" CommentCount="1" />
"""
soup = BeautifulSoup(xml, "html.parser")
for tag in soup.select("posts row"):
result = re.sub("<.*?>", "", tag["body"])
print(result.strip())
输出:
Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.
The Straw Hats started out from the first half and are now sailing across the second half.
Wouldn't it have been quicker to set sail in the opposite direction from where they started?
In the middle of The Dark Tournament, Yusuke Urameshi gets to fully inherit Genkai's power of the Spirit Wave by absorbing a ball of energy from her.
However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.
My question is, why is it such a painful procedure to learn and absorb this power?
In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons.
What's the significance of the watermelon and why does she carry one around?
1赞
dabingsou
10/9/2020
#2
另一种方法。
from simplified_scrapy import SimplifiedDoc, utils, req
xml = '''<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="8" CreationDate="2012-12-11T20:37:08.823" Score="67" ViewCount="17934" Body="<p>Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line.</p>

<p>The Straw Hats started out from the first half and are now sailing across the second half.</p>

<p>Wouldn't it have been quicker to set sail in the opposite direction from where they started? </p>
" OwnerUserId="21" LastEditorUserId="1398" LastEditDate="2015-04-17T19:06:38.957" LastActivityDate="2015-05-26T12:50:40.920" Title="The treasure in One Piece is at the end of the Grand Line. But isn't that the same as the beginning?" Tags="<one-piece>" AnswerCount="5" CommentCount="0" FavoriteCount="2" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="33" CreationDate="2012-12-11T20:39:40.780" Score="13" ViewCount="279" Body="<p>In the middle of <em>The Dark Tournament</em>, Yusuke Urameshi gets to fully inherit Genkai's power of the <em>Spirit Wave</em> by absorbing a ball of energy from her.</p>

<p>However, this process turns into an excruciating trial for Yusuke, almost killing him, and keeping him doubled over in extreme pain for a long period of time, so much so that his Spirit Animal, Poo, is also in pain and flies to him to try to help.</p>

<p>My question is, why is it such a painful procedure to learn and absorb this power?</p>
" OwnerUserId="26" LastEditorUserId="247" LastEditDate="2013-02-26T17:02:31.570" LastActivityDate="2013-06-20T03:31:39.187" Title="Why does absorbing the Spirit Wave from Genkai involve such a painful process?" Tags="<yu-yu-hakusho>" AnswerCount="1" CommentCount="0" />
<row Id="3" PostTypeId="1" AcceptedAnswerId="148" CreationDate="2012-12-11T20:42:47.447" Score="9" ViewCount="3022" Body="<p>In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons.</p>

<p>What's the significance of the watermelon and why does she carry one around?</p>
" OwnerUserId="29" LastActivityDate="2014-01-15T21:01:55.043" Title="What's the significance of the watermelon in Sora no Otoshimono?" Tags="<sora-no-otoshimono>" AnswerCount="2" CommentCount="1" />
'''
doc = SimplifiedDoc(xml)
rows = doc.selects('row>Body()')
print ([doc.removeHtml(doc.unescape(row)) for row in rows])
结果:
['Assuming the world in the One Piece universe is round, then there is not really a beginning or an end of the Grand Line. The Straw Hats started out from the first half and are now sailing across the second half. Wouldn', 'In the middle of The Dark Tournament, Yusuke Urameshi gets to fully inherit Genkai', 'In Sora no Otoshimono, Ikaros carries around a watermelon like a pet and likes watermelons and pretty much anything else round. At one point she even has a watermelon garden and attacks all the bugs that get near the melons. What']
评论
0赞
Robur_131
10/9/2020
如果我的 xml 文件中有内容怎么办?说?我是否将文件路径传递给 SimplifiedDoc?Anime.xml
0赞
dabingsou
10/9/2020
@Robur_131 你可以这样做: xml = utils.getFileContent('Anime.xml') doc = SimplifiedDoc(xml)
评论