提问人:BlackHeart 提问时间:4/1/2023 更新时间:4/1/2023 访问量:34
Python Web Scraper 未使用任何抓取的数据填充 .txt 文件
Python Web Scraper not populating .txt file with any scraped data
问:
我是python的新手,但我对正在发生的事情有一点了解。我正在尝试用 BeautifulSoup 编写一个网络爬虫。我正在抓取一个站点的一串数字,然后将该数字字符串写入 .txt 文件,以便我以后可以返回并将该 .txt 文件用作数据集。
import requests
from bs4 import BeautifulSoup
# Scrape the links to each monthly results page
url = "http://www.calotteryx.com/Fantasy-5/drawing-results-calendar.htm"
response = requests.get(url)
content = response.content
soup = BeautifulSoup(content, "html.parser")
links = []
for link in soup.find_all("td", class_="td0"):
href = link.get("href")
if href:
links.append(href)
# Scrape the winning numbers for each monthly results page
winning_numbers = []
for link in links:
response = requests.get(link)
content = response.content
soup = BeautifulSoup(content, "html.parser")
for tag in soup.find_all("div", class_="ball blue5 fcblack1"):
numbers = tag.text.strip().split()
winning_numbers.append(numbers)
# Write the winning numbers to a file
with open("winning_numbers.txt", "w") as f:
for numbers in winning_numbers:
f.write(" ".join(numbers) + "\n")
这是我的代码,当我运行它时,我不会收到任何错误,但我最终也得到了一个空白的“winning_numbers.txt”文件。有人可以帮我指出正确的方向,告诉我我在这里做错了什么吗?
答:
1赞
Unmitigated
4/1/2023
#1
<td>
元素没有属性;你想要里面的元素。href
href
<a>
此外,这些链接都是相对的,因此您需要将其加入基 URL 以发出进一步的请求(可以使用 )。urllib.parse.urljoin
from urllib.parse import urljoin
for link in soup.find_all("td", class_="td0"):
anchor = link.find('a')
if anchor and (href := anchor.get('href')):
links.append(urljoin(url, href))
评论