为什么在 Python 中使用 Selenium 和 BeautifulSoup 抓取评论时 break 语句不起作用?

Why does the break statement not work while scraping reviews with Selenium and BeautifulSoup in Python?

提问人:Mina 提问时间:4/9/2023 更新时间:4/9/2023 访问量:46

问:

我正在 Python 中使用 Selenium 和 BeautifulSoup 抓取评论,但该语句不起作用,因此即使在到达产品的最后一个评论页面后,循环也会继续。据我了解,它应该可以工作,因为最后一个评论页面上不再有“下一页”按钮。有人可以解释为什么不起作用吗?breakwhilebreak

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from bs4 import BeautifulSoup

driver = webdriver.Chrome()

reviewlist= []

ASINs = ['B09SXL3HPG', 'B07TP8LLQZ']
for a in range(len(ASINs)):
    url = 'https://www.amazon.de/dp/' + str(ASINs[a])
    driver.get(url)

    #Accept cookies (only if you need it)
    if a == 0:
        accept_cookies = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "input[id='sp-cc-accept']"))).click()

    #Go to all reviews
    WebDriverWait(driver, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, "a[data-hook='see-all-reviews-link-foot']"))).click()

    while True:
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        reviews = soup.find_all('div', {'data-hook': 'review'})
        for item in reviews:
            review = {
                'date': item.find('span', {'data-hook': 'review-date'}).text.strip(),
                'body': item.find('span', {'data-hook': 'review-body'}).text.strip(),
                }
            reviewlist.append(review)

        try:
            next_page_button = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.a-pagination .a-last'))).click()
            print('click')
            time.sleep(1)
        except Exception as ex:
            print(ex,"!!!!!!!")
            break
python selenium-webdriver beautifulsoup while-loop break

评论


答:

1赞 Driftr95 4/9/2023 #1

据我了解,它应该可以工作,因为最后一个评论页面上不再有“下一页”按钮。

除了有,即使它被禁用:

next page


因此,可能最终会无休止地单击它而没有任何效果。我建议使用作为选择器,而不仅仅是 .a-pagination .a-lastwhile.a-pagination .a-last:not(.a-disabled)

评论

0赞 Mina 4/9/2023
它现在起作用了!谢谢!:)我一直认为,因为它说“禁用”,所以这将被视为一个不同的元素。
1赞 Driftr95 4/9/2023
@Mina很乐意帮忙!是的,选择器选择包含该类的每个元素。例如,会专门选择禁用的“下一页”按钮,即使他们有.{class}.a-pagination .a-last.a-disabledclass="a-disabled a-last btn-grey"