Python Web Scraper 未使用任何抓取的数据填充 .txt 文件

Python Web Scraper not populating .txt file with any scraped data

提问人:BlackHeart 提问时间:4/1/2023 更新时间:4/1/2023 访问量:34

问:

我是python的新手,但我对正在发生的事情有一点了解。我正在尝试用 BeautifulSoup 编写一个网络爬虫。我正在抓取一个站点的一串数字,然后将该数字字符串写入 .txt 文件,以便我以后可以返回并将该 .txt 文件用作数据集。

import requests
from bs4 import BeautifulSoup

# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, "html.parser")
links = []
for link in soup.find_all("td", class_="td0"):
    href = link.get("href")
    if href:
        links.append(href)

# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
    response = requests.get(link)
    content = response.content
    soup = BeautifulSoup(content, "html.parser")
    for tag in soup.find_all("div", class_="ball blue5 fcblack1"):
        numbers = tag.text.strip().split()
        winning_numbers.append(numbers)

# Write the winning numbers to a file
with open("winning_numbers.txt", "w") as f:
    for numbers in winning_numbers:
        f.write(" ".join(numbers) + "\n")

这是我的代码,当我运行它时,我不会收到任何错误,但我最终也得到了一个空白的“winning_numbers.txt”文件。有人可以帮我指出正确的方向,告诉我我在这里做错了什么吗?

python 网页抓取 beautifulsoup html 解析

评论


答:

1赞 Unmitigated 4/1/2023 #1

<td>元素没有属性;你想要里面的元素。hrefhref<a>

此外,这些链接都是相对的,因此您需要将其加入基 URL 以发出进一步的请求(可以使用 )。urllib.parse.urljoin

from urllib.parse import urljoin

for link in soup.find_all("td", class_="td0"):
    anchor = link.find('a')
    if anchor and (href := anchor.get('href')):
        links.append(urljoin(url, href))