提问人:justjanga 提问时间:4/22/2023 更新时间:4/23/2023 访问量:37
为什么我的代码会给我一个 AttributeError?
Why is my code giving me an AttributeError?
问:
我正在尝试遍历几个级别的 html 以检索与立法相关的链接。但是,一旦我到达链接的第 2 级,而不是检索与单个账单关联的链接列表,我就收到错误:
发生异常:AttributeError “NoneType”对象没有属性“startswith” 文件“C:\Users\Justin\Desktop\ilgascrapetest1.py”,第 14 行,在 如果 href.startswith('/legislation/BillStatus.asp?'): ^^^^^^^^^^^^^^^ AttributeError:“NoneType”对象没有属性“startswith”
这是到目前为止的代码:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent
# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
href = link.get('href')
if href.startswith('/legislation/BillStatus.asp?'):
bill_url = url + href
bill_response = requests.get(bill_url)
bill_soup = BeautifulSoup(bill_response.content, 'html.parser')
# Find the table cell with width
td = bill_soup.find('td', {'width': '100%'})
# Iterate through all the <li> elements in table
for li in td.find_all('li'):
print(li.text)
我能够从第一页 Html 中的“众议院账单”表中检索链接列表并遍历该链接列表,但在给出单个账单链接列表的下一级别中,我收到错误而不是从 HB0001 到 HB4042 的账单链接。为什么我会收到此错误
答:
0赞
VANN Universal
4/22/2023
#1
此站点上有多个元素没有 ,因此在这种情况下将返回。你不能调用,所以你必须添加一个检查是否是:<a>
href
link.get('href')
None
startswith()
None
href
None
import requests
from bs4 import BeautifulSoup
url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent
# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
href = link.get('href')
if not href:
continue # Ignore links without href
if href.startswith('/legislation/BillStatus.asp?'):
bill_url = url + href
bill_response = requests.get(bill_url)
bill_soup = BeautifulSoup(bill_response.content, 'html.parser')
# Find the table cell with width
td = bill_soup.find('td', {'width': '100%'})
# Iterate through all the <li> elements in table
for li in td.find_all('li'):
print(li.text)
此外,您混淆了网址:首先,您需要打开“grplist.asp”,然后链接以“BillStatus.asp”开头。要仅访问房屋账单部分中的链接,您需要选择带有名称的链接,而不是其父项。我还更改了您的代码,因此不再从包含“/default.asp”的完整 url 构建。div
a
h_bills
bill_url
import requests
from bs4 import BeautifulSoup
url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Find the House Bills section (next div after a with name "h_bills")
house_bills = soup.find('a', {"name": "h_bills"}).find_next_sibling("div")
# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
href = link.get('href')
if not href:
continue # Ignore links without href
if href.startswith('grplist.asp?'):
bill_url = "https://www.ilga.gov/legislation/" + href
bill_response = requests.get(bill_url)
if bill_response.status_code != 200: # Prevent crash when response is not valid
continue
bill_soup = BeautifulSoup(bill_response.content, 'html.parser')
# Find the table cell with width
td = bill_soup.find('td', {'width': '100%'})
# Iterate through all the <li> elements in table
for li in td.find_all('li'):
print(li.text)
评论
0赞
justjanga
4/23/2023
好的,这消除了错误,但现在它不会打印任何东西。
0赞
VANN Universal
4/23/2023
对不起,我没有正确测试我的答案。现在它包含获取所有链接的工作代码。
0赞
justjanga
4/23/2023
啊,我什至没有考虑将 /default.asp 作为 url 的一部分。现在它正在工作,谢谢!
评论
href
None
link.get('href')
None