在 Python 中使用请求和 LXML 抓取网站

Scraping a website using requests and LXML in Python

提问人:K Max 提问时间:10/4/2023 最后编辑:Andreas ViolarisK Max 更新时间:10/4/2023 访问量:59

问:

我正在尝试抓取此网站以检索标题和正文内容(“描述”和“功能”)以及 PDF 链接。但是,当我尝试使用 XPath 提取文本时,我收到一个空列表。但是,正如您在以下屏幕截图中看到的那样,后面有一个文本块。/html/body/center[2]/table/tbody/tr[3]/td/font/text()/font

i

这是我的代码:

import requests
from lxml import html
from urllib.request import urlretrieve


url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"

try:
    # Send an HTTP GET request to the URL
    response = requests.get(url)
    response.raise_for_status()

    # Parse the HTML content of the page using lxml
    page_content = html.fromstring(response.text)

    # Extract the title using XPath
    title_element = page_content.xpath("/html/body/center[2]/table/tbody/tr[2]/td/strong/font")
    title = title_element[0].text_content() if title_element else "Title not found"

    # Extract the body using XPath
    body_elements = page_content.xpath("/html/body/center[2]/table/tbody/tr[3]/td/font/text()")
    body = "\n".join(body_elements) if body_elements else "Body not found"

    # Extract the download link
    download_link_element = page_content.xpath('//a[starts-with(@href, "/pdf-file/1060035/GME/PC817/1")]')
    if download_link_element:
        download_link = download_link_element[0].attrib['href']
        download_url = f"https://datasheetspdf.com{download_link}"
    else:
        download_url = "Download link not found"

    # Download the file
    file_name = "PC817_datasheet.pdf"
    urlretrieve(download_url, file_name)
    print(f"Title: {title}")
    print(f"Body:\n{body}")
    print(f"Downloaded {file_name} successfully.")

except requests.exceptions.RequestException as e:
    print(f"Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

我感谢任何帮助。

python 网页抓取 请求 lxml urlretrieve

评论


答:

1赞 Barry the Platipus 10/4/2023 #1

为了获得该表,您可能只能使用 pandas:

import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

df = pd.read_html('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')[2]
print(df)

终端结果:

0   1
0   Part    PC817
1   Description 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
2   Feature Production specification 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER FEATURES z Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V) z High isolation voltage between inputc and output (Viso=5000V rms) z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048) PC817 Series Description The PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option. DIP4L APPLICATIONS z Programmable controllers z System appliances.
3   Manufacture GME
4   Datasheet   Download PC817 Datasheet

Pandas 文档可以在这里找到。

编辑:下载实际的PDF文件:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://datasheetspdf.com/pdf/1060035/GME/PC817/1')
intermediary_url = 'https://datasheetspdf.com' + bs(r.text, 'html.parser').select_one('a[href^="/pdf-file/"]').get('href')
r = requests.get(intermediary_url)
true_pdf_url = bs(r.text, 'html.parser').select_one('iframe[class="pdfif"]').get('src')
f = open('pdf_file.pdf', 'wb')
with requests.get(true_pdf_url, stream=True) as r:
    with open('pdf_file.pdf', 'wb') as f:
        f.write(r.content)
print('done')

文件将作为 下载到与运行代码相同的文件夹中。 有关请求文档,请转到此处pdf_file.pdf

2赞 Andreas Violaris 10/4/2023 #2

此网页结构不合理,因此使用 XPath 可能不是最佳方法。我会推荐美丽的汤

import requests
from bs4 import BeautifulSoup

url = "https://datasheetspdf.com/pdf/1060035/GME/PC817/1"

response = requests.get(url)
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser')
rows = soup.find_all('tr')

for row in rows:
    th = row.find('th')
    if th:
        header_text = th.get_text()
        if header_text == "Part":
            part = row.find('td').get_text()
        elif header_text == "Description":
            description = row.find('td').get_text()
        elif header_text == "Feature":
            feature = row.find('td').get_text()
        elif header_text == "Manufacture":
            manufacture = row.find('td').get_text()
        elif header_text == "Datasheet":
            datasheet = row.find('a')['href']

print("Part:", part)
print("Description:", description)
print("Feature:", feature)
print("Manufacture:", manufacture)
print("Datasheet:", f'https://datasheetspdf.com{datasheet}')

输出:

Part: PC817
Description: 4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLER
Feature: Production specification4 PIN DIP PHOTOTRANSISTOR PHOTOCOUPLERFEATURESz Current transfer ratio (CTR:50%-600% at IF=5mA,VCE=5V)z High isolation voltage between inputc and output (Viso=5000V rms)z Creepage distance>7.62mm z Pb free and ROHS compliant z UL/CUL Approved (File No. E340048)PC817 SeriesDescriptionThe PC817 series of devices each consist of an infrared Emitting diodes, optically coupled to a phototransistor detector. They are packaged in a 4-pin DIP package and available in Wide-lead spacing and SMD option.DIP4LAPPLICATIONSz Programmable controllers z System appliances.
Manufacture: GME
Datasheet: https://datasheetspdf.com/pdf-file/1060035/GME/PC817/1