检索分页后面的元素屏幕截图

Retrieving elements screenshots behind pagination

提问人:Fariz Awi 提问时间:11/14/2023 最后编辑:Mark RotteveelFariz Awi 更新时间:11/15/2023 访问量:76

问:

请查看此网站。我的目标是检索给定 URL 的页面中所有 PDF 链接的屏幕截图。

首先,我尝试请求 URL 并解析 HTML 文本并找到所有 PDF 链接:

from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Found file: ", i, link.get('href', []))

此任务成功执行,显示所有 864 个文件。

现在,我尝试对包含链接的页面窗口进行全窗Selenium屏幕截图:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
from selenium.webdriver.common.by import By

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        url_pdf = link.get('href')
        element = driver.find_element(By.XPATH,'//a[@href="'+url_pdf+'"]' )
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))


driver.quit()

它在下一页的元素处失败:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://aplng.com.au/wp-content/uploads/2022/06/Australia-Pacific-LNG-Pty-Limited-FY2021-Tax-Contribution-Report.pdf"]"}
  (Session info: chrome=117.0.5938.88); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception

  1. URL 请求如何成功返回所有 PDF 链接?但是当使用webdriver时,不是吗?
  2. 我在网上研究过的所有类似问题都建议执行下一页元素点击。但我认为这个解决方案太具体了。有没有更通用的解决方案?
  3. 如果执行单击是唯一/最佳解决方案,我如何确保缓解类似情况

我读过 AJAX,但我不太明白。我对网络技术的了解仍然很少,所以请根据需要随意进行彻底的了解。

python ajax selenium-webdriver web-scraping beautifulsoup

评论


答:

-4赞 Tech Support Pakistan 11/14/2023 #1

这就是代码实现预期结果的样子

from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen

url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})

webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0 

for link in links:
if ('.pdf' in link.get('href', [])):
    i += 1
    print("Found file: ", i, link.get('href', []))

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
if '.pdf' in link.get('href', []):
    i += 1
    url_pdf = link.get('href')
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
        )
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))
    except Exception as e:
        print(f"Error taking screenshot for {url_pdf}: {e}")

driver.quit()

driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

for link in links:
if '.pdf' in link.get('href', []):
    i += 1
    url_pdf = link.get('href')
    try:
        element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
        )
        driver.execute_script("arguments[0].scrollIntoView();", element)
        _ = element.screenshot_as_png 
        driver.get_screenshot_as_file(f'screenshot_{i}.png')
        print("Found file: ", i, link.get('href', []))
    except Exception as e:
        print(f"Error taking screenshot for {url_pdf}: {e}")

 driver.quit()

评论

1赞 Yaroslavm 11/14/2023
这不是一个答案,前提是网站没有滚动分页。
1赞 Yaroslavm 11/14/2023 #2

没有针对每种情况的通用解决方案。这是某种逆向工程,每种情况都可能不同。

在您的情况下,通过选择“在页面上显示”下拉列表中的选项可以更轻松地做到这一点。All

但是在其他资源上可以采用其他方法,例如

它是特定于站点的,取决于在资源上实现的分页逻辑。

您的情况:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#other imports

driver.get(url)
wait = WebDriverWait(driver, 10)
items_dropdown = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dataTables_length [class*=select2-container]')))
items_dropdown.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@class='select2-results']//li[text()='All']"))).click()

for link in links:
    if ('.pdf' in link.get('href', [])):
    # your code