在 Python 中使用请求和 LXML 抓取网站-解网

问：

我正在尝试抓取此网站以检索标题和正文内容（“描述”和“功能”）以及 PDF 链接。但是，当我尝试使用 XPath 提取文本时，我收到一个空列表。但是，正如您在以下屏幕截图中看到的那样，后面有一个文本块。/html/body/center[2]/table/tbody/tr[3]/td/font/text()/font

这是我的代码：

import requests
from lxml import html
from urllib.request import urlretrieve


url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"

try:
    # Send an HTTP GET request to the URL
    response = requests.get(url)
    response.raise_for_status()

    # Parse the HTML content of the page using lxml
    page_content = html.fromstring(response.text)

    # Extract the title using XPath
    title_element = page_content.xpath("/html/body/center[2]/table/tbody/tr[2]/td/strong/font")
    title = title_element[0].text_content() if title_element else "Title not found"

    # Extract the body using XPath
    body_elements = page_content.xpath("/html/body/center[2]/table/tbody/tr[3]/td/font/text()")
    body = "\n".join(body_elements) if body_elements else "Body not found"

    # Extract the download link
    download_link_element = page_content.xpath('//a[starts-with(@href, "/pdf-file/1060035/GME/PC817/1")]')
    if download_link_element:
        download_link = download_link_element[0].attrib['href']
        download_url = f"https://datasheetspdf.com{download_link}"
    else:
        download_url = "Download link not found"

    # Download the file
    file_name = "PC817_datasheet.pdf"
    urlretrieve(download_url, file_name)
    print(f"Title: {title}")
    print(f"Body:\n{body}")
    print(f"Downloaded {file_name} successfully.")

except requests.exceptions.RequestException as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

我感谢任何帮助。

python 网页抓取请求 lxml urlretrieve

0   1
0   Part    PC817
1   Description 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
2   Feature Production specification 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER FEATURES z Current transfer ratio （CTR：50%-600% at IF=5mA,VCE=5V） z High isolation voltage between inputc and output （Viso=5000V rms） z Creepage distance＞7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048) PC817 Series Description The PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option. DIP4L APPLICATIONS z Programmable controllers z System appliances.
3   Manufacture GME
4   Datasheet   Download PC817 Datasheet

Pandas 文档可以在这里找到。

编辑：下载实际的PDF文件：

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')
intermediary_url = 'https://datasheetspdf.com' + bs(r.text, 'html.parser').select_one('a[href^="/pdf-file/"]').get('href')
r = requests.get(intermediary_url)
true_pdf_url = bs(r.text, 'html.parser').select_one('iframe[class="pdfif"]').get('src')
f = open('pdf_file.pdf', 'wb')
with requests.get(true_pdf_url, stream=True) as r:
    with open('pdf_file.pdf', 'wb') as f:
        f.write(r.content)
print('done')

文件将作为下载到与运行代码相同的文件夹中。有关请求文档，请转到此处。pdf_file.pdf

2赞 Andreas Violaris 10/4/2023 #2

此网页结构不合理，因此使用 XPath 可能不是最佳方法。我会推荐美丽的汤：

import requests
from bs4 import BeautifulSoup

url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"

response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
rows = soup.find_all('tr')

for row in rows:
    th = row.find('th')
    if th:
        header_text = th.get_text()
        if header_text == "Part":
            part = row.find('td').get_text()
        elif header_text == "Description":
            description = row.find('td').get_text()
        elif header_text == "Feature":
            feature = row.find('td').get_text()
        elif header_text == "Manufacture":
            manufacture = row.find('td').get_text()
        elif header_text == "Datasheet":
            datasheet = row.find('a')['href']

print("Part:", part)
print("Description:", description)
print("Feature:", feature)
print("Manufacture:", manufacture)
print("Datasheet:", f'https://datasheetspdf.com{datasheet}')

输出：

Part: PC817
Description: 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
Feature: Production specification4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLERFEATURESz Current transfer ratio （CTR：50%-600% at IF=5mA,VCE=5V）z High isolation voltage between inputc and output （Viso=5000V rms）z Creepage distance＞7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048)PC817 SeriesDescriptionThe PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option.DIP4LAPPLICATIONSz Programmable controllers z System appliances.
Manufacture: GME
Datasheet: https://datasheetspdf.com/pdf-file/1060035/GME/PC817/1

上一个：在 Python 中从给定的 html 中获取所有 xpath 列表的最佳方法是什么？

下一个：有条件地将 XML（Word 文档）中的节点替换为 python？

在 Python 中使用请求和 LXML 抓取网站

Scraping a website using requests and LXML in Python

评论