检索分页后面的元素屏幕截图-解网

问：

请查看此网站。我的目标是检索给定 URL 的页面中所有 PDF 链接的屏幕截图。

首先，我尝试请求 URL 并解析 HTML 文本并找到所有 PDF 链接：

from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Found file: ", i, link.get('href', []))

此任务成功执行，显示所有 864 个文件。

现在，我尝试对包含链接的页面窗口进行全窗Selenium屏幕截图：

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
from selenium.webdriver.common.by import By

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        url_pdf = link.get('href')
        element = driver.find_element(By.XPATH,'//a[@href="'+url_pdf+'"]' )
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))


driver.quit()

它在下一页的元素处失败：

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://aplng.com.au/wp-content/uploads/2022/06/Australia-Pacific-LNG-Pty-Limited-FY2021-Tax-Contribution-Report.pdf"]"}
  (Session info: chrome=117.0.5938.88); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception

后

URL 请求如何成功返回所有 PDF 链接？但是当使用webdriver时，不是吗？
我在网上研究过的所有类似问题都建议执行下一页元素点击。但我认为这个解决方案太具体了。有没有更通用的解决方案？
如果执行单击是唯一/最佳解决方案，我如何确保缓解类似情况

我读过 AJAX，但我不太明白。我对网络技术的了解仍然很少，所以请根据需要随意进行彻底的了解。

python ajax selenium-webdriver web-scraping beautifulsoup

from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

for link in links:
if ('.pdf' in link.get('href', [])):
    i += 1
    print("Found file: ", i, link.get('href', []))

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
if '.pdf' in link.get('href', []):
    i += 1
    url_pdf = link.get('href')
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
        )
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))
    except Exception as e:
        print(f"Error taking screenshot for {url_pdf}: {e}")

driver.quit()

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
if '.pdf' in link.get('href', []):
    i += 1
    url_pdf = link.get('href')
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
        )
        driver.execute_script("arguments[0].scrollIntoView();", element)
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))
    except Exception as e:
        print(f"Error taking screenshot for {url_pdf}: {e}")

 driver.quit()

连续滚动到页面末尾，示例 2
在循环中单击下一个分页按钮
在 url 中传递页码参数（如果网站分页是通过 example.com/ 或等参数实现的）?page=1?p=1
如果筛选器参数存储在 url 中，则传递检索所需数据的筛选器参数，例如 example.com/。?limit=1000
选择输出的“全部”选项（本例）

它是特定于站点的，取决于在资源上实现的分页逻辑。

您的情况：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#other imports

driver.get(url)
wait = WebDriverWait(driver, 10)
items_dropdown = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dataTables_length [class*=select2-container]')))
items_dropdown.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@class='select2-results']//li[text()='All']"))).click()

for link in links:
    if ('.pdf' in link.get('href', [])):
    # your code

上一个：为什么 TypeScript 代码没有从 C# 代码中获取任何返回值？

下一个：在WP_User_Query中使用模板部件时，如何获取用户的 ACF 数据？

检索分页后面的元素屏幕截图

Retrieving elements screenshots behind pagination

评论

评论