提问人:K Max 提问时间:10/4/2023 最后编辑:Andreas ViolarisK Max 更新时间:10/4/2023 访问量:59
在 Python 中使用请求和 LXML 抓取网站
Scraping a website using requests and LXML in Python
问:
我正在尝试抓取此网站以检索标题和正文内容(“描述”和“功能”)以及 PDF 链接。但是,当我尝试使用 XPath 提取文本时,我收到一个空列表。但是,正如您在以下屏幕截图中看到的那样,后面有一个文本块。/html/body/center[2]/table/tbody/tr[3]/td/font/text()
/font
这是我的代码:
import requests
from lxml import html
from urllib.request import urlretrieve
url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"
try:
# Send an HTTP GET request to the URL
response = requests.get(url)
response.raise_for_status()
# Parse the HTML content of the page using lxml
page_content = html.fromstring(response.text)
# Extract the title using XPath
title_element = page_content.xpath("/html/body/center[2]/table/tbody/tr[2]/td/strong/font")
title = title_element[0].text_content() if title_element else "Title not found"
# Extract the body using XPath
body_elements = page_content.xpath("/html/body/center[2]/table/tbody/tr[3]/td/font/text()")
body = "\n".join(body_elements) if body_elements else "Body not found"
# Extract the download link
download_link_element = page_content.xpath('//a[starts-with(@href, "/pdf-file/1060035/GME/PC817/1")]')
if download_link_element:
download_link = download_link_element[0].attrib['href']
download_url = f"https://datasheetspdf.com{download_link}"
else:
download_url = "Download link not found"
# Download the file
file_name = "PC817_datasheet.pdf"
urlretrieve(download_url, file_name)
print(f"Title: {title}")
print(f"Body:\n{body}")
print(f"Downloaded {file_name} successfully.")
except requests.exceptions.RequestException as e:
print(f"Error: {e}")
except Exception as e:
print(f"An error occurred: {e}")
我感谢任何帮助。
答:
1赞
Barry the Platipus
10/4/2023
#1
为了获得该表,您可能只能使用 pandas:
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
df = pd.read_html('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')[2]
print(df)
终端结果:
0 1
0 Part PC817
1 Description 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
2 Feature Production specification 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER FEATURES z Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V) z High isolation voltage between inputc and output (Viso=5000V rms) z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048) PC817 Series Description The PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option. DIP4L APPLICATIONS z Programmable controllers z System appliances.
3 Manufacture GME
4 Datasheet Download PC817 Datasheet
Pandas 文档可以在这里找到。
编辑:下载实际的PDF文件:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')
intermediary_url = 'https://datasheetspdf.com' + bs(r.text, 'html.parser').select_one('a[href^="/pdf-file/"]').get('href')
r = requests.get(intermediary_url)
true_pdf_url = bs(r.text, 'html.parser').select_one('iframe[class="pdfif"]').get('src')
f = open('pdf_file.pdf', 'wb')
with requests.get(true_pdf_url, stream=True) as r:
with open('pdf_file.pdf', 'wb') as f:
f.write(r.content)
print('done')
文件将作为 下载到与运行代码相同的文件夹中。
有关请求文档,请转到此处。pdf_file.pdf
2赞
Andreas Violaris
10/4/2023
#2
此网页结构不合理,因此使用 XPath 可能不是最佳方法。我会推荐美丽的汤:
import requests
from bs4 import BeautifulSoup
url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"
response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
th = row.find('th')
if th:
header_text = th.get_text()
if header_text == "Part":
part = row.find('td').get_text()
elif header_text == "Description":
description = row.find('td').get_text()
elif header_text == "Feature":
feature = row.find('td').get_text()
elif header_text == "Manufacture":
manufacture = row.find('td').get_text()
elif header_text == "Datasheet":
datasheet = row.find('a')['href']
print("Part:", part)
print("Description:", description)
print("Feature:", feature)
print("Manufacture:", manufacture)
print("Datasheet:", f'https://datasheetspdf.com{datasheet}')
输出:
Part: PC817
Description: 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
Feature: Production specification4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLERFEATURESz Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V)z High isolation voltage between inputc and output (Viso=5000V rms)z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048)PC817 SeriesDescriptionThe PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option.DIP4LAPPLICATIONSz Programmable controllers z System appliances.
Manufacture: GME
Datasheet: https://datasheetspdf.com/pdf-file/1060035/GME/PC817/1
评论