尝试使用 pd.read_html 提取信息并将数据导出到 Pandas 数据帧-解网

问：

我正在尝试从此维基百科页面上的表格中提取信息以自动收集数据。

网页链接：https://en.wikipedia.org/wiki/List_of_members_of_the_17th_Lok_Sabha

我正在使用 pd.read_html 将信息视为数据帧。我希望能够将模块导出到 .csv 文件，以便可以存储和分析数据。

当我将数据更改为传统数据帧时，我只能从第一个表中提取信息。我希望能够从表格上的所有页面中提取信息。

我能够为每个表创建一个数据帧，但这不是超级高效，也不是最佳实践。

这是我用来提取信息的代码，但它只返回一个表。

import pandas as pd
import requests
from bs4 import BeautifulSoup

members = []

wikiUrl = 'https://en.wikipedia.org/wiki/List_of_members_of_the_17th_Lok_Sabha'
page = requests.get(wikiUrl)
soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable sortable'})
for table in tables:
    members.append(table)

frame = pd.read_html(str(members))

df=pd.DataFrame(frame[0])
print(df)

我尝试只使用“frame”变量，这样我就可以做所有的表。

import pandas as pd
import requests
from bs4 import BeautifulSoup

members = []

wikiUrl = 'https://en.wikipedia.org/wiki/List_of_members_of_the_17th_Lok_Sabha'
page = requests.get(wikiUrl)
soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.find_all('table', {'class': 'wikitable sortable'})
for table in tables:
    members.append(table)

frame = pd.read_html(str(members))

df=pd.DataFrame(frame, dtype=object)
print(df)

df.to_csv('India 17th Lok Sabha.csv')

但它打印以下内容：

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  values = np.array([convert(v) for v in values])
                                                    0
0        #     Constituency  ... Party            ...
1      #    Constituency          Name  Party     ...
2        #              Constituency  ... Party   ...
3        #       Constituency  ... Party          ...
4        #  Constituency              Name  Party ...
5      # Constituency                Name  Party  ...
6        #         Constituency  ... Party        ...
7       #          Constituency  ... Party        ...
8      # Constituency  ... Party                  ...
9        #    Constituency  ... Party             ...
10       #         Constituency  ... Party        ...
11       #        Constituency  ... Party         ...
12       #    Constituency  ... Party             ...
13       #            Constituency  ... Party     ...
14     #        Constituency  ... Party           ...
15     #   Constituency           Name  Party     ...
16     #  Constituency           Name  Party      ...
17     # Constituency  ... Party                  ...
18       #        Constituency  ... Party         ...
19       #          Constituency  ... Party       ...
20       #          Constituency  ... Party       ...
21     # Constituency              Name  Party    ...
22       #       Constituency  ... Party          ...
23       #       Constituency  ... Party          ...
24     #       Constituency             Name  Part...
25       #      Constituency  ... Party           ...
26     #               Constituency  ... Party    ...
27       #          Constituency  ... Party       ...
28     #                 Constituency  ... Party  ...
29     # Constituency         Name  Party         ...
30     #  ...                                 Part...
31     # Constituency  ... Party                  ...
32     # Constituency                     Name  Pa...
33     #      Constituency                  Name  ...
34     #           Constituency             Name  ...
35     # Constituency             Name  Party     ...

pandas 数据帧网页抓取 BeautifulSoup HTML 解析

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_members_of_the_17th_Lok_Sabha"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

dfs = []
for table in soup.select("table.wikitable"):
    name = table.find_previous("h2").span.text
    df = pd.read_html(str(table))[0]
    df["Region"] = name
    df = df.drop(columns=["#", "Party"])
    df = df.rename(columns={"Party.1": "Party"})
    dfs.append(df)

df_out = pd.concat(dfs)

# print first 20 items from the DataFrame:
print(df_out.head(20).to_markdown(index=False))

指纹：

选区	名字	党	地区
荒乐（ST）	戈迪蒂·马达维	YSR大会党	安得拉邦
斯里卡库拉姆	拉姆·莫汉·奈杜·金贾拉普	泰卢固语德萨姆党	安得拉邦
维齐亚纳加拉姆	贝拉娜·钱德拉·塞哈尔	YSR大会党	安得拉邦
维沙卡帕特南	M. V. V. 萨蒂亚纳拉亚纳	YSR大会党	安得拉邦
阿纳卡帕利	比塞蒂·文卡塔·萨蒂亚瓦蒂	YSR大会党	安得拉邦
柿田	万加吉塔	YSR大会党	安得拉邦
阿马拉普拉姆（SC）	钦塔·阿努拉达	YSR大会党	安得拉邦
拉贾蒙德里	玛格尼·巴拉特	YSR大会党	安得拉邦
纳拉萨普拉姆	拉古·拉玛·克里希纳·拉朱（Raghu Rama Krishna Raju）	YSR大会党	安得拉邦
埃鲁鲁	科塔吉里·斯里达尔	YSR大会党	安得拉邦
马基利帕特南	瓦拉巴尼尼·巴拉苏里	YSR大会党	安得拉邦
维杰亚瓦达	凯西尼·斯里尼瓦斯	泰卢固语德萨姆党	安得拉邦
贡图尔	加拉·贾亚德夫	泰卢固语德萨姆党	安得拉邦
纳拉萨拉奥佩特	拉武·斯里·克里希纳·德瓦拉亚卢	YSR大会党	安得拉邦
巴帕特拉（南卡罗来纳州）	南迪甘·苏雷什	YSR大会党	安得拉邦
昂戈莱	马贡塔·斯里尼瓦苏鲁·雷迪	YSR大会党	安得拉邦
南迪亚尔	波查·布拉马南达·雷迪	YSR大会党	安得拉邦
库尔努尔	桑吉夫·库马尔（Sanjeev Kumar）	YSR大会党	安得拉邦
阿南塔普尔	塔拉里·兰盖亚	YSR大会党	安得拉邦
印度教徒	库鲁瓦·戈兰特拉·马达夫	YSR大会党	安得拉邦

上一个：如何使用 python 抓取网页中列出的每个个人链接的数据？

下一个：如何在有/没有废弃网页的情况下读取 div 详细信息，这在 java 源代码中不存在？

尝试使用 pd.read_html 提取信息并将数据导出到 Pandas 数据帧

Trying to use pd.read_html to extract information and export data to a Pandas dataframe

评论