Selenium python:从 <div 获取所有 <ul> 的所有 <li> 文本>

Selenium python: get all the <li> text of all the <ul> from a <div>

提问人:MagTun 提问时间:5/5/2021 最后编辑:MagTun 更新时间:5/5/2021 访问量:1131

问:

我想从几页中获取所有单词列表。dutch word = english word

通过检查 HTML,这意味着我需要从 的子 div 中获取所有文本。liul#mw-content-text

这是我的代码:

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)

listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

这是输出

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']

我不明白为什么有些文本即使它们的 xpath 相同也无法检索(我通过开发人员控制台的复制 xpath 仔细检查了其中的几个)li

python selenium xpath html 解析

评论

0赞 vitaliis 5/5/2021
页面上有什么?li
0赞 MagTun 5/5/2021
@vitaliis,谢谢你抽出时间。对不起这个错误,我已经编辑了这个问题,使问题更清楚

答:

2赞 MendelG 5/5/2021 #1

尝试等待页面完全加载后再解析它,一种方法是使用 time.sleep() 方法:

from time import sleep
...

for url in listURL:
    driver.get(url)
    sleep(5)
    ...

编辑:使用:BeautifulSoup

import requests
from bs4 import BeautifulSoup


listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    
    for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
        print(tag.text)
        print()
        print(tag.find_next("ul").text)
        print("-" * 80)
    print()

输出(截断):

Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1

man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
--------------------------------------------------------------------------------
Lesson 2

meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
--------------------------------------------------------------------------------
Lesson 3

appel = apple

... And on

如果您希望输出为:list

for url in listURL:
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    print("Link:", url)
    print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
    print("-" * 80)

评论

0赞 MagTun 5/5/2021
谢谢你,但有些事情真的很奇怪:当我添加睡眠时,即使我只添加了而没有更改 xpath 或其他任何东西,我也没有在输出中输入任何单词(如果我注释掉睡眠,问题中的几个单词又回来了)。sleep(5)
0赞 MendelG 5/5/2021
也许页面在加载时卡住了。尝试在输出后打印任何内容吗?sleep
0赞 MendelG 5/5/2021
@vitaliis为什么?在我的机器上运行 OP 的代码时,我确实得到了一个输出,问题不应该是需要等待元素加载吗?
0赞 MagTun 5/5/2021
@MendelG,我尝试了“print(driver)”,两者都被打印出来了print("test")
0赞 MagTun 5/5/2021
@MendelG,你写道你“得到一个输出”。您是否从所有页面中获得所有单词的完整输出?
1赞 vitaliis 5/5/2021 #2

您的脚本似乎没问题,但我会添加显式或隐式等待。 尝试等到页面上的所有元素都可见:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
    WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
    elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_tag_name("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

此外,您可以在声明 .driver.implicitly_wait(15)driver

输出:

['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']

更新:我找到了一种更可靠的CSS选择器方法。请尝试一下:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = webdriver.ChromeOptions()
options.add_argument('headless')  # start chrome without opening window

driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
driver.implicitly_wait(15)
listURL = [
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
    "https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]


list_text = []
for url in listURL:
    driver.get(url)
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))
    elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')
    for each_ul in elem:
        all_li = each_ul.find_elements_by_css_selector("li")
        for li in all_li:
            list_text.append(li.text)

print(list_text)

更新 2在尝试了解原因后,我发现广告占用了大部分加载时间。所以我要补充一点,等到所有广告都加载完毕。wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))

我还通过删除最后一个将第二个等待更改为.我认为没有必要。您也可以尝试删除第二个等待,看看是否有帮助。.mw-parser-output>ulli

评论

0赞 MagTun 5/5/2021
我收到此错误selenium.common.exceptions.TimeoutException: Message:
0赞 vitaliis 5/5/2021
信息是什么?
0赞 MagTun 5/5/2021
消息为空
0赞 vitaliis 5/5/2021
尝试使用并设置超时时间,而不是 5 秒,而是 10 或 15 秒presence_of_all_elements_located
1赞 MendelG 5/5/2021
更新我的答案后,我注意到输出不在,我基本上使用了与您相同的CSS选择器(没有注意到您的答案)list
0赞 Prophet 5/5/2021 #3

WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))

你需要增加一些睡眠,我想就足够了,只有在那之后才这样做time.sleep(1)

elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')

您的问题是由误解功能引起的。
它实际上并没有等待您传递定位器定位的所有元素变得可见,它不知道要等待多少这样的元素。
因此,一旦它检测到至少 1 个与您的定位器匹配的元素可见 - 它就会返回检测到的元素列表,程序将继续前进。
有关这些方法的更多详细信息,请参阅此处和官方文档。
visibility_of_all_elements_located

评论

0赞 vitaliis 5/5/2021
从源代码:An expectation for checking that all elements are present on the DOM of a page and visible. Visibility means that the elements are not only displayed but also has a height and width that is greater than 0. locator - used to find the elements returns the list of WebElements once they are located and visible
1赞 Prophet 5/5/2021
你看到那里的链接和答案了吗?你认为 Selenium 应该如何知道要等待多少元素?
0赞 vitaliis 5/5/2021
是的,我检查过了。我同意。对我来说,这个问题需要更多的探索。为了更好地理解它,还应该考虑内在方法。_find_elements
0赞 MagTun 5/5/2021
我收到与答案相同的错误@vitaliis并且消息为空。但是我看到该页面已完全加载(包括我想检索的单词)在selenium打开的浏览器中。selenium.common.exceptions.TimeoutException: Message:
0赞 Prophet 5/5/2021
您在哪个代码行上收到此错误?