在 Python 2.7 中使用 HTMLParser 从 HTML 表格添加到列表中的新项目-解网

问：

在有关 HTMLParser 的文档和此 stackoverflow 帖子的帮助下，我尝试从表中提取数据，同时从之间的表中提取数据，并在其中包含新项目时将新项目附加到列表中。<td>..</td>appendsstarttag

下面有一个小例子来解释我的问题：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.in_td = False
        self._out = []

    def handle_starttag(self, tag, attrs):
        if tag == 'td':
            self.in_td = True

    def handle_endtag(self, tag):
        self.in_td = False

    def handle_data(self, data):
        if self.in_td:
            #print(data)
            self._out.append(data)


if __name__ == "__main__":
    parser = MyHTMLParser()
    link_raw = """
<html><p><center><h1>  Clash Report 1  </h1></center></p><p><table border=on>  <th> Errors </th><th>  Elements </th>
<tr>  <td>  Delete one of those.  </td>
<td>  060 : <Room Separation> : Model Lines : id 549036  <br>  060 : <Room Separation> : Model Lines : id 549042</td></tr>
<tr>  <td>  Delete one of those.  </td>
<td>  060 : <Room Separation> : Model Lines : id 549036  <br>  060 : <Room Separation> : Model Lines : id 549081</td></tr>
"""
    #<html><head><title>Test</title></head><body><tr><td>yes</td><td>no</td></tr></body></html>

    parser.feed(link_raw)
    print (parser._out)

输出

['  Delete one of those.  ', '  060 : ', ' : Model Lines : id 549036  ', '  060 : ', ' : Model Lines : id 549042', '  Delete one of those.  ', '  060 : ', ' : Model Lines : id 549036  ', '  060 : ', ' : Model Lines : id 549081']

如何忽略这些标签，例如和并仅将数据附加到一个项目之间，如下所示<Room Separation><br><td>..</td>

Desired OUTPUT [' 删除其中之一。'， ' 060 ：：模型行： id 549036 '， ' 060 ：：模型行： id 549042'， ' 删除其中之一。'， ' 060 ：：模型线： id 549036 '， ' 060 ：：模型线： id 549081']

python-2.7 表html 解析

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self._stack = []
        self._out = []

    def handle_starttag(self, tag, attrs):
        if tag in ['br', 'room']: return
        self._stack.append(tag)

    def handle_endtag(self, tag):
        self._stack.pop()

    def handle_data(self, data):
        if self._stack and self._stack[-1] == 'td':
            self._out.append(data)


if __name__ == "__main__":
    parser = MyHTMLParser()
    link_raw = """
<html><p><center><h1>  Clash Report 1  </h1></center></p><p><table border=on>  <th> Errors </th><th>  Elements </th>
<tr>  <td>  Delete one of those.  </td>
<td>  060 : <Room Separation> : Model Lines : id 549036  <br>  060 : <Room Separation> : Model Lines : id 549042</td></tr>
<tr>  <td>  Delete one of those.  </td>
<td>  060 : <Room Separation> : Model Lines : id 549036  <br>  060 : <Room Separation> : Model Lines : id 549081</td></tr>
"""
    #<html><head><title>Test</title></head><body><tr><td>yes</td><td>no</td></tr></body></html>

    parser.feed(link_raw)
    result = parser._out
    print (len(result))
    print (result)

输出：

10
['  Delete one of those.  ', '  060 : ', ' : Model Lines : id 549036  ', '  060 : ', ' : Model Lines : id 549042', '  Delete one of those.  ', '  060 : ', ' : Model Lines : id 549036  ', '  060 : ', ' : Model Lines : id 549081']

首先，我的代码的输出很容易转换为您想要的输出。考虑一个实例。你想要我给你 Smply 连接系列中的字符串对。其次，您提到的标签（例如）是由 HTMLParser 从 HTML 中解析出来的。我的代码只是在找到它们时将它们放在堆栈上，然后在找到它们相应的结束标签时将它们弹出。这就是它如何知道它何时“进入”细胞，并且必须收集字符。[' Delete one of those. ', ' 060 : : Model Lines : id 549036 ',[' Delete one of those. ', ' 060 : ', ' : Model Lines : id 549036 ',htmltd

0赞 Bill Bell 8/18/2017

你可以通过打开一个编辑器来检查我的第一个断言，将我的结果放在你想要的结果正下方的一行中。这就是我刚刚所做的。

0赞 Watarap 8/18/2017

我检查了你说的比较两者并没有真正发现任何区别。它是一样的字母对字母

0赞 Bill Bell 8/18/2017

我的意思是将你想要的与我的剧本制作的东西进行比较。

上一个：在网络爬虫中解析 HTML 页面

下一个：在 Python2.7 中使用 xpath 解析 html

在 Python 2.7 中使用 HTMLParser 从 HTML 表格添加到列表中的新项目

New items added to list from HTML Table using HTMLParser in Python 2.7

评论

评论