提问人:tre-ananas 提问时间:11/10/2023 更新时间:11/10/2023 访问量:29
使用 Selenium 分页和接受 Cookie
Pagination and Accepting Cookies with Selenium
问:
我正在尝试抓取存档以进行情绪分析,但我似乎无法接受 cookie 或单击“下一步”按钮。当我尝试接受 cookie 时,错误在“元素点击拦截:元素在点 (906, 934) 处不可点击”和“消息:将目标移出界”之间交替出现。
我正在尝试从 https://www.regeringen.se/dokument-och-publikationer 中抓取出版物链接和日期/出版物信息。我目前拥有的代码收集了第一页的所有内容,但该页面是动态的,因此简单地修改 URL 不会移动到下一页。我用 Selenium 尝试了许多不同的策略,但我是新手,并且不断遇到相同的错误。
这是我一直在尝试让分页和接受 cookie 按钮工作的代码:
# Base URL for the search query
url = "https://www.regeringen.se/dokument-och-publikationer"
# Set up the Selenium WebDriver
driver = webdriver.Chrome('chromedriver-win64\chromedriver.exe') # You can use other web drivers like Firefox if you prefer
# Open the URL in the web browser
driver.get(url)
# Wait for the cookie accept button to be clickable and click it using XPath
try:
cookie_accept_button = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, '/html/body/div[2]/div/div[2]/button'))
)
# Click the cookie accept button
cookie_accept_button.click()
print("Accepted Cookies")
# Wait for the "Next" button to be clickable
next_button = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.CSS_SELECTOR, 'a.filter-pagination[data-page="2"]'))
)
# Click the "Next" button
next_button.click()
print("Next Button Clicked")
# You can perform further actions on the next page here
except Exception as e:
print("Error:", e)
# Don't forget to close the WebDriver when you are done
driver.quit()
以下是我在尝试使用 Selenium 之前使用的代码:
# Cycle through first three pages of archive and grab the links that lead to useful content
# Base URL for the search query
base_url = "https://www.regeringen.se/dokument-och-publikationer/?"
query_param = "page="
page_number = 1
# List to store all extracted links
all_links = []
all_dates = []
while page_number <= 3:
# Construct the URL for the current page
url = f"{base_url}{query_param}{page_number}"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content of the page
soup = BeautifulSoup(response.content, "html.parser")
# Find all <ul> tags with class "list--block list--search"
ul_tags = soup.find_all("ul", class_="list--block cl")
# Extract and prepend "https://www.regeringen.se/" to each link (excluding links that start with "/tx")
links = []
dates = []
for ul_tag in ul_tags:
# Find all <a> tags within the current <ul> tag
a_tags = ul_tag.find_all("a", href=True)
# Extract href attribute (link) from each <a> tag and prepend the base URL before appending to the links list
for a in a_tags:
link = a["href"]
# Check if the link does not start with "/tx" before adding the base URL and appending it
if not link.startswith("/tx"):
full_url = "https://www.regeringen.se" + link
links.append(full_url)
# Find all <div> tags with class "block--timeLinks"
date_links_divs = soup.find_all('div', class_='block--timeLinks')
for t in date_links_divs:
date = t.get_text(strip=True)
dates.append(date)
# Add the links from the current page to the list of all links
all_links.extend(links)
all_dates.extend(dates)
# Check if there is a "Next" button on the current page
next_button = soup.find("a", class_="filter-pagination")
# If there is no "Next" button, break the loop (reached the last page)
if not next_button:
break
# Move to the next page
page_number += 1
# Introduce a randome delay time before the next request
time.sleep(randint(5, 10)) # Adjust the delay time as needed
答:
0赞
Shawn
11/10/2023
#1
修改您的硒代码如下,它将单击按钮,然后单击按钮。Accept Cookies
Next
# Set up the Selenium WebDriver
driver = webdriver.Chrome('chromedriver-win64\chromedriver.exe') # You can use other web drivers like Firefox if you prefer
driver.maximize_window()
# Open the URL in the web browser
driver.get(url)
# Create WebDriverWait object
wait = WebDriverWait(driver, 20)
# Wait for the cookie accept button to be clickable and click it using XPath
cookie_accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@class='btn c-cookie__action js-cookie-click']")))
# Below line will use Javascript to click on the cooke accept button
driver.execute_script("arguments[0].click();", cookie_accept_button)
print("Accepted Cookies")
# Wait for the "Next" button to be clickable
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//li[@class='nav--pagination__next']//a")))
next_button.click()
print("Next Button Clicked")
# Don't forget to close the WebDriver when you are done
driver.quit()
控制台输出:
Accepted Cookies
Next Button Clicked
Process finished with exit code 0
评论