为什么我的代码会给我一个 AttributeError?

Why is my code giving me an AttributeError?

提问人:justjanga 提问时间:4/22/2023 更新时间:4/23/2023 访问量:37

问:

我正在尝试遍历几个级别的 html 以检索与立法相关的链接。但是,一旦我到达链接的第 2 级,而不是检索与单个账单关联的链接列表,我就收到错误:

发生异常:AttributeError “NoneType”对象没有属性“startswith” 文件“C:\Users\Justin\Desktop\ilgascrapetest1.py”,第 14 行,在 如果 href.startswith('/legislation/BillStatus.asp?'): ^^^^^^^^^^^^^^^ AttributeError:“NoneType”对象没有属性“startswith”

这是到目前为止的代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if href.startswith('/legislation/BillStatus.asp?'):
        bill_url = url + href
        bill_response = requests.get(bill_url)
        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)   

我能够从第一页 Html 中的“众议院账单”表中检索链接列表并遍历该链接列表,但在给出单个账单链接列表的下一级别中,我收到错误而不是从 HB0001 到 HB4042 的账单链接。为什么我会收到此错误

html python-3.x beautifulsoup html解析

评论

2赞 Carcigenicate 4/22/2023
这意味着是,这意味着是返回;likey 表示该元素没有该属性。hrefNonelink.get('href')None

答:

0赞 VANN Universal 4/22/2023 #1

此站点上有多个元素没有 ,因此在这种情况下将返回。你不能调用,所以你必须添加一个检查是否是:<a>hreflink.get('href')Nonestartswith()NonehrefNone

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if not href:
        continue  # Ignore links without href
    if href.startswith('/legislation/BillStatus.asp?'):
        bill_url = url + href
        bill_response = requests.get(bill_url)
        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)   

此外,您混淆了网址:首先,您需要打开“grplist.asp”,然后链接以“BillStatus.asp”开头。要仅访问房屋账单部分中的链接,您需要选择带有名称的链接,而不是其父项。我还更改了您的代码,因此不再从包含“/default.asp”的完整 url 构建。divah_billsbill_url

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section (next div after a with name "h_bills")
house_bills = soup.find('a', {"name": "h_bills"}).find_next_sibling("div")

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if not href:
        continue  # Ignore links without href

    if href.startswith('grplist.asp?'):
        bill_url = "https://www.ilga.gov/legislation/" + href

        bill_response = requests.get(bill_url)
        if bill_response.status_code != 200:  # Prevent crash when response is not valid
            continue

        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)

评论

0赞 justjanga 4/23/2023
好的,这消除了错误,但现在它不会打印任何东西。
0赞 VANN Universal 4/23/2023
对不起,我没有正确测试我的答案。现在它包含获取所有链接的工作代码。
0赞 justjanga 4/23/2023
啊,我什至没有考虑将 /default.asp 作为 url 的一部分。现在它正在工作,谢谢!