提问人:30ThreeDegrees 提问时间:5/28/2023 更新时间:5/28/2023 访问量:47
如何使用 pandas 解析 html 表数据的特定部分
How to parse a specific part of html table data using pandas
问:
我一直在学习如何使用 Pandas 抓取网页,但我遇到了一些障碍,我无法提取其内部的特定数据。
这是 Pandas 正在解析的 html:
<tr data-country="Bulgaria">
<td><i aria-hidden="true" class="
circle-country-flags-22 flags-22-bulgaria display-inline-block"></i>
<a title="Bulgaria Economic Calendar" href="https://www.myfxbook.com/forex-economic-
calendar/bulgaria">Bulgaria</a></td>
<td>BNB</td>
<td> <a title="Bulgaria Interest Rates" href="https://www.myfxbook.com/forex-economic-
calendar/bulgaria/interest-rate-decision">Bulgarian National Bank</a> </td>
<td class="green"> 2.17% </td>
<td>1.82%</td>
<td> 35bp </td>
<td data-custom-date="2023-04-28 00:00:00.0">Apr 28, 2023</td>
<td data-custom-date="2023-05-29 10:00:00.0">1 day</td>
</tr>
这是我的响应数组的样子:
{'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': 'Apr 28, 2023',
'Next Meeting': '1 day',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'}
这是我专门看的“1天”一行
当我试图将这个“2023-05-29 10:00:00.0”解析为响应而不是“1 天”时
这是我到目前为止为此创建的代码:
import pandas as pd
import requests
import pprint
from datetime import datetime, timedelta
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
r = requests.get(url)
tables = pd.read_html(r.text) # this parses all the tables in webpages to a list
# Extract the first table from the list of parsed tables
parsed_table = tables[0]
# Convert DataFrame to list of dictionaries
list_of_dicts = parsed_table.to_dict(orient='records')
# Print the list of dictionaries
data = []
for row in list_of_dicts:
data.append(row)
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
我一直在搜索互联网,但到目前为止还没有找到解决方案,因此对此没有任何帮助。
答:
1赞
Andrej Kesely
5/28/2023
#1
简单的解决方案是使用 HTML 解析器(例如)并替换标签的文本。然后用于获取数据帧:beautifulsoup
<td>
pd.read_html
import pprint
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.myfxbook.com/forex-economic-calendar/interest-rates"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# select all tags with data-custom-date= attribute
for tag in soup.select('[data-custom-date]'):
# replace the text of these tags with value of this attribute
tag.string.replace_with(tag['data-custom-date'])
parsed_table = pd.read_html(str(soup))[0]
data = parsed_table.to_dict(orient="records")
pp = pprint.PrettyPrinter(depth=4)
pp.pprint(data)
指纹:
[{'Central Bank': 'Bulgarian National Bank',
'Change': '35bp',
'Country': 'Bulgaria',
'Current Rate': '2.17%',
'Last Meeting': '2023-04-28 00:00:00.0',
'Next Meeting': '2023-05-29 10:00:00.0',
'Previous Rate': '1.82%',
'Unnamed: 1': 'BNB'},
{'Central Bank': 'Central Bank of Kenya',
'Change': '75bp',
'Country': 'Kenya',
'Current Rate': '9.5%',
'Last Meeting': '2023-03-29 00:00:00.0',
'Next Meeting': '2023-05-29 13:30:00.0',
'Previous Rate': '8.75%',
'Unnamed: 1': 'CBK'},
{'Central Bank': 'National Bank of the Kyrgyz Republic',
'Change': '0bp',
'Country': 'Kyrgyzstan',
'Current Rate': '13.0%',
...and so on.
评论
1赞
30ThreeDegrees
5/28/2023
非常感谢你!,在看到你的实现后,我今天早上一直在阅读 bs4,我现在可以看到我错过了什么。再次感谢,我真的很感激,你的回复正是我想要的。
评论