提问人:MagTun 提问时间:5/5/2021 最后编辑:MagTun 更新时间:5/5/2021 访问量:1131
Selenium python:从 <div 获取所有 <ul> 的所有 <li> 文本>
Selenium python: get all the <li> text of all the <ul> from a <div>
问:
我想从几页中获取所有单词列表。dutch word = english word
通过检查 HTML,这意味着我需要从 的子 div 中获取所有文本。li
ul
#mw-content-text
这是我的代码:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless') # start chrome without opening window
driver = webdriver.Chrome(chrome_options=options)
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
driver.get(url)
elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
for each_ul in elem:
all_li = each_ul.find_elements_by_tag_name("li")
for li in all_li:
list_text.append(li.text)
print(list_text)
这是输出
['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
'', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']
我不明白为什么有些文本即使它们的 xpath 相同也无法检索(我通过开发人员控制台的复制 xpath 仔细检查了其中的几个)li
答:
尝试等待页面完全加载后再解析它,一种方法是使用 time.sleep()
方法:
from time import sleep
...
for url in listURL:
driver.get(url)
sleep(5)
...
编辑:使用:BeautifulSoup
import requests
from bs4 import BeautifulSoup
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print("Link:", url)
for tag in soup.select("[id*=Lesson]:not([id*=Lessons])"):
print(tag.text)
print()
print(tag.find_next("ul").text)
print("-" * 80)
print()
输出(截断):
Link: https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1
Lesson 1
man = man
vrouw = woman
jongen = boy
ik = I
ben = am
een = a/an
en = and
--------------------------------------------------------------------------------
Lesson 2
meisje = girl
kind = child/kid
hij = he
ze = she (unstressed)
is = is
of = or
--------------------------------------------------------------------------------
Lesson 3
appel = apple
... And on
如果您希望输出为:list
for url in listURL:
soup = BeautifulSoup(requests.get(url).content, "html.parser")
print("Link:", url)
print([tag.text for tag in soup.select(".mw-parser-output > ul li")])
print("-" * 80)
评论
sleep(5)
sleep
print("test")
您的脚本似乎没问题,但我会添加显式或隐式等待。 尝试等到页面上的所有元素都可见:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('headless') # start chrome without opening window
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
driver.get(url)
WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
for each_ul in elem:
all_li = each_ul.find_elements_by_tag_name("li")
for li in all_li:
list_text.append(li.text)
print(list_text)
此外,您可以在声明 .driver.implicitly_wait(15)
driver
输出:
['man = man', 'vrouw = woman', 'jongen = boy', 'ik = I', 'ben = am', 'een = a/an', 'en = and', 'meisje = girl', 'kind = child/kid', 'hij = he', 'ze = she (unstressed)', 'is = is', 'of = or', 'appel = apple', 'melk = milk', 'drinkt = drinks (2nd and 3rd person singular)', 'drink = drink (1st person singular)', 'eet = eat(s) (singular)', 'de = the', 'sap = juice', 'water = water', 'brood = bread', 'het = it, the', 'je = you (singular informal, unstressed)', 'bent = are (2nd person singular)', 'Zijn (to be)', 'Hebben (to have)', 'Mogen (to be allowed to)', 'Willen (to want)', 'Kunnen (to be able to)', 'Zullen ("will")', 'boterham = sandwich', 'rijst = rice', 'we = we (unstressed)', 'jullie = you (plural informal)', 'eten = eat (plural)', 'drinken = drink (plural)', 'vrouwen = women', 'mannen = men', 'meisjes = girls', 'krant = newspaper', 'lezen = read (plural)', 'jongens = boys', 'menu = menu', 'dat = that', 'zijn = are (plural)', 'ze = they (unstressed)', 'heb = have (1st person singular)', 'heeft = has (3rd person singular)', 'hebt = have (2nd person singular)', 'hebben = have (plural)', 'boek = book', 'lees = read (1st person singular)', 'leest = read(s) (2nd and 3rd person singular)', 'kinderen = children', 'spreken = speak (plural)', 'spreek = speak (1st person singular)', 'spreekt = speak(s) (2nd and 3rd person singular)', 'hallo = hello', 'bedankt = thanks', 'doei = bye', 'dag = goodbye', 'tot ziens = see you later', 'hoi = hi', 'goedemorgen = good morning', 'goededag = good day', 'goedenavond = good evening', 'goedenacht = good night', 'welterusten = good night', 'ja = yes', 'dank je wel = thank you very much', 'alsjeblieft = please', 'sorry = sorry', 'het spijt me = I am sorry', 'oké = okay', 'pardon = excuse me', 'hoe gaat het = how are you', 'goed = good, fine, well', 'dank je = thank you', '(een) beetje = (a) bit of', 'Engels = English', 'Nederlands = Dutch', 'Geen: negating indefinite nouns (you can think of it as "no" things or "none of" a thing if that helps). Geen replaces the indefinite pronoun in question.', 'Niet: negating a verb, adjective or definite nouns. Niet comes at the end of a sentence or directly after the verb zijn.', 'nee = no', 'niet = not', 'geen = not']
更新:我找到了一种更可靠的CSS选择器方法。请尝试一下:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument('headless') # start chrome without opening window
driver = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver', options=options)
driver.implicitly_wait(15)
listURL = [
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Basics_2",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Phrases_1",
"https://duolingo.fandom.com/wiki/Dutch_(NL)_Skill:Negative_1",
]
list_text = []
for url in listURL:
driver.get(url)
wait = WebDriverWait(driver, 15)
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '.mw-parser-output>ul')))
elem = driver.find_elements_by_css_selector('.mw-parser-output>ul')
for each_ul in elem:
all_li = each_ul.find_elements_by_css_selector("li")
for li in all_li:
list_text.append(li.text)
print(list_text)
更新 2在尝试了解原因后,我发现广告占用了大部分加载时间。所以我要补充一点,等到所有广告都加载完毕。wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[id*='google_ads_iframe'] ")))
我还通过删除最后一个将第二个等待更改为.我认为没有必要。您也可以尝试删除第二个等待,看看是否有帮助。.mw-parser-output>ul
li
评论
selenium.common.exceptions.TimeoutException: Message:
presence_of_all_elements_located
list
后
WebDriverWait(driver, 15).until(EC.visibility_of_all_elements_located((By.XPATH, '//*[@id="mw-content-text"]/div/ul')))
你需要增加一些睡眠,我想就足够了,只有在那之后才这样做time.sleep(1)
elem = driver.find_elements_by_xpath('//*[@id="mw-content-text"]/div/ul')
您的问题是由误解功能引起的。
它实际上并没有等待您传递定位器定位的所有元素变得可见,它不知道要等待多少这样的元素。
因此,一旦它检测到至少 1 个与您的定位器匹配的元素可见 - 它就会返回检测到的元素列表,程序将继续前进。
有关这些方法的更多详细信息,请参阅此处和官方文档。visibility_of_all_elements_located
评论
An expectation for checking that all elements are present on the DOM of a page and visible. Visibility means that the elements are not only displayed but also has a height and width that is greater than 0. locator - used to find the elements returns the list of WebElements once they are located and visible
_find_elements
selenium.common.exceptions.TimeoutException: Message:
评论
li