如何迭代HTML文件并将特定数据解析为Dataframe？-解网

问：

我已经查看了从XML解析器到XML解析器的各种方法，我认为必须有一种更简单的方法来遍历HTML文件以将信息解析为数据帧表。有很多关于特定章节标题的信息：BeautifulSoup

<h2 class="chapter-header-western">CHAPTER 1</h2>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>

从 docx 文件转换的 html 有点混乱，但我需要做的就是将粗体数字后面的每段文本解析到它自己的行中：<b>#</b>

章	数	发短信
1	1	发短信
1	2	发短信
1	3	发短信

也许我需要制作一个标签作为描述？<b>#</b>

我尝试使用 BeautifulSoup find_all但这只返回标签之间的字符串，我需要一种方法来返回一组标签后面的文本。

python 数据帧 web-scraping beautifulsoup html-解析

例

from bs4 import BeautifulSoup
html = '''
<h2 class="chapter-header-western">CHAPTER 1</h2>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>1</b>text
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in"><b>Header</b></p>
    <p align="left" style="line-height: 100%; margin-bottom: 0.08in"><b>2</b>text
    </p>
    <p align="left" style="line-height: 120%; margin-left: 0.3in; text-indent: -0.3in; margin-bottom: 0.08in">
    <b>3 </b>text
    </p>
    <p class="western" style="line-height: 100%; margin-bottom: 0.08in">text <b>4 </b>text
    <b>5 </b>text<b>6 </b>text
    </p>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('b'):
    if e.get_text(strip=True).isnumeric():
        data.append({
            'chapter': e.find_previous('h2').get_text(strip=True).split()[-1],
            'number': e.get_text(strip=True),
            'text': e.next_sibling.strip() if e.next_sibling else None
        })

pd.DataFrame(data)

输出

	章	数	发短信
0	1	1	发短信
1	1	2	发短信
2	1	3	发短信
3	1	4	发短信
4	1	5	发短信
5	1	6	发短信

上一个：Python Selenium 搜索只有文本的 sebbling [object Text]

下一个：如何使用 python 抓取网页中列出的每个个人链接的数据？

如何迭代HTML文件并将特定数据解析为Dataframe？

How to iterate HTML file and parse specific data to Dataframe?

评论

例

输出