提问人:Fariz Awi 提问时间:11/14/2023 最后编辑:Mark RotteveelFariz Awi 更新时间:11/15/2023 访问量:76
检索分页后面的元素屏幕截图
Retrieving elements screenshots behind pagination
问:
请查看此网站。我的目标是检索给定 URL 的页面中所有 PDF 链接的屏幕截图。
首先,我尝试请求 URL 并解析 HTML 文本并找到所有 PDF 链接:
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Found file: ", i, link.get('href', []))
此任务成功执行,显示所有 864 个文件。
现在,我尝试对包含链接的页面窗口进行全窗Selenium屏幕截图:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
from selenium.webdriver.common.by import By
url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
url_pdf = link.get('href')
element = driver.find_element(By.XPATH,'//a[@href="'+url_pdf+'"]' )
_ = element.screenshot_as_png
driver.get_screenshot_as_file(f'screenshot_{i}.png')
print("Found file: ", i, link.get('href', []))
driver.quit()
它在下一页的元素处失败:
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//a[@href="https://aplng.com.au/wp-content/uploads/2022/06/Australia-Pacific-LNG-Pty-Limited-FY2021-Tax-Contribution-Report.pdf"]"}
(Session info: chrome=117.0.5938.88); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
后
- URL 请求如何成功返回所有 PDF 链接?但是当使用webdriver时,不是吗?
- 我在网上研究过的所有类似问题都建议执行下一页元素点击。但我认为这个解决方案太具体了。有没有更通用的解决方案?
- 如果执行单击是唯一/最佳解决方案,我如何确保缓解类似情况
我读过 AJAX,但我不太明白。我对网络技术的了解仍然很少,所以请根据需要随意进行彻底的了解。
答:
-4赞
Tech Support Pakistan
11/14/2023
#1
这就是代码实现预期结果的样子
from bs4 import BeautifulSoup as soup
import requests
from urllib.request import Request, urlopen
url = "https://aplng.com.au/document-library/"
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
links = page_soup.find_all('a')
i = 0
for link in links:
if ('.pdf' in link.get('href', [])):
i += 1
print("Found file: ", i, link.get('href', []))
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
for link in links:
if '.pdf' in link.get('href', []):
i += 1
url_pdf = link.get('href')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
)
_ = element.screenshot_as_png
driver.get_screenshot_as_file(f'screenshot_{i}.png')
print("Found file: ", i, link.get('href', []))
except Exception as e:
print(f"Error taking screenshot for {url_pdf}: {e}")
driver.quit()
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
for link in links:
if '.pdf' in link.get('href', []):
i += 1
url_pdf = link.get('href')
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, '//a[@href="'+url_pdf+'"]'))
)
driver.execute_script("arguments[0].scrollIntoView();", element)
_ = element.screenshot_as_png
driver.get_screenshot_as_file(f'screenshot_{i}.png')
print("Found file: ", i, link.get('href', []))
except Exception as e:
print(f"Error taking screenshot for {url_pdf}: {e}")
driver.quit()
评论
1赞
Yaroslavm
11/14/2023
这不是一个答案,前提是网站没有滚动分页。
1赞
Yaroslavm
11/14/2023
#2
没有针对每种情况的通用解决方案。这是某种逆向工程,每种情况都可能不同。
在您的情况下,通过选择“在页面上显示”下拉列表中的选项可以更轻松地做到这一点。All
但是在其他资源上可以采用其他方法,例如
- 连续滚动到页面末尾,示例 2
- 在循环中单击下一个分页按钮
- 在 url 中传递页码参数(如果网站分页是通过 example.com/ 或 等参数实现的)
?page=1
?p=1
- 如果筛选器参数存储在 url 中,则传递检索所需数据的筛选器参数,例如 example.com/。
?limit=1000
- 选择输出的“全部”选项(本例)
它是特定于站点的,取决于在资源上实现的分页逻辑。
您的情况:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#other imports
driver.get(url)
wait = WebDriverWait(driver, 10)
items_dropdown = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.dataTables_length [class*=select2-container]')))
items_dropdown.click()
wait.until(EC.visibility_of_element_located((By.XPATH, "//*[@class='select2-results']//li[text()='All']"))).click()
for link in links:
if ('.pdf' in link.get('href', [])):
# your code
评论