使用 Selenium 分页和接受 Cookie

Pagination and Accepting Cookies with Selenium

提问人:tre-ananas 提问时间:11/10/2023 更新时间:11/10/2023 访问量:29

问:

我正在尝试抓取存档以进行情绪分析,但我似乎无法接受 cookie 或单击“下一步”按钮。当我尝试接受 cookie 时,错误在“元素点击拦截:元素在点 (906, 934) 处不可点击”和“消息:将目标移出界”之间交替出现。

我正在尝试从 https://www.regeringen.se/dokument-och-publikationer 中抓取出版物链接和日期/出版物信息。我目前拥有的代码收集了第一页的所有内容,但该页面是动态的,因此简单地修改 URL 不会移动到下一页。我用 Selenium 尝试了许多不同的策略,但我是新手,并且不断遇到相同的错误。

这是我一直在尝试让分页和接受 cookie 按钮工作的代码:

# Base URL for the search query
url = "https://www.regeringen.se/dokument-och-publikationer"

# Set up the Selenium WebDriver
driver = webdriver.Chrome('chromedriver-win64\chromedriver.exe')  # You can use other web drivers like Firefox if you prefer

# Open the URL in the web browser
driver.get(url)

# Wait for the cookie accept button to be clickable and click it using XPath
try:
    cookie_accept_button = WebDriverWait(driver, 20).until(
        EC.element_to_be_clickable((By.XPATH, '/html/body/div[2]/div/div[2]/button'))
    )

    # Click the cookie accept button
    cookie_accept_button.click()
    print("Accepted Cookies")

    # Wait for the "Next" button to be clickable
    next_button = WebDriverWait(driver, 20).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, 'a.filter-pagination[data-page="2"]'))
    )

    # Click the "Next" button
    next_button.click()
    print("Next Button Clicked")

    # You can perform further actions on the next page here

except Exception as e:
    print("Error:", e)

# Don't forget to close the WebDriver when you are done
driver.quit()

以下是我在尝试使用 Selenium 之前使用的代码:

# Cycle through first three pages of archive and grab the links that lead to useful content

# Base URL for the search query
base_url = "https://www.regeringen.se/dokument-och-publikationer/?"
query_param = "page="
page_number = 1

# List to store all extracted links
all_links = []
all_dates = []

while page_number <= 3: 
    # Construct the URL for the current page
    url = f"{base_url}{query_param}{page_number}"

    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, "html.parser")

    # Find all <ul> tags with class "list--block list--search"
    ul_tags = soup.find_all("ul", class_="list--block cl")

    # Extract and prepend "https://www.regeringen.se/" to each link (excluding links that start with "/tx")
    links = []
    dates = []
    for ul_tag in ul_tags:
        # Find all <a> tags within the current <ul> tag
        a_tags = ul_tag.find_all("a", href=True)
        # Extract href attribute (link) from each <a> tag and prepend the base URL before appending to the links list
        for a in a_tags:
            link = a["href"]
            # Check if the link does not start with "/tx" before adding the base URL and appending it
            if not link.startswith("/tx"):
                full_url = "https://www.regeringen.se" + link
                links.append(full_url)
        # Find all <div> tags with class "block--timeLinks"
        date_links_divs = soup.find_all('div', class_='block--timeLinks')
        for t in date_links_divs:
            date = t.get_text(strip=True)
            dates.append(date)

    # Add the links from the current page to the list of all links
    all_links.extend(links)
    all_dates.extend(dates)

    # Check if there is a "Next" button on the current page
    next_button = soup.find("a", class_="filter-pagination")

    # If there is no "Next" button, break the loop (reached the last page)
    if not next_button:
        break

    # Move to the next page
    page_number += 1

    # Introduce a randome delay time before the next request
    time.sleep(randint(5, 10))  # Adjust the delay time as needed
python html selenium-webdriver cookies beautifulsoup

评论


答:

0赞 Shawn 11/10/2023 #1

修改您的硒代码如下,它将单击按钮,然后单击按钮。Accept CookiesNext

# Set up the Selenium WebDriver
driver = webdriver.Chrome('chromedriver-win64\chromedriver.exe')  # You can use other web drivers like Firefox if you prefer
driver.maximize_window()

# Open the URL in the web browser
driver.get(url)

# Create WebDriverWait object
wait = WebDriverWait(driver, 20)

# Wait for the cookie accept button to be clickable and click it using XPath
cookie_accept_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//button[@class='btn c-cookie__action js-cookie-click']")))
# Below line will use Javascript to click on the cooke accept button
driver.execute_script("arguments[0].click();", cookie_accept_button)
print("Accepted Cookies")

# Wait for the "Next" button to be clickable
next_button = wait.until(EC.element_to_be_clickable((By.XPATH, "//li[@class='nav--pagination__next']//a")))
next_button.click()
print("Next Button Clicked")

# Don't forget to close the WebDriver when you are done
driver.quit()

控制台输出:

Accepted Cookies
Next Button Clicked

Process finished with exit code 0