Python Web 抓取脚本无法通过 xPath 找到元素,即使它存在

Python web scraping script does not find element by xPath even though it exists

提问人:GoldNova 提问时间:1/31/2021 更新时间:1/31/2021 访问量:155

问:

目前,我正在编写一个小脚本,该脚本应根据我国价格比较网站的链接提取最便宜产品的名称,链接,价格和图像。

示例链接如下所示:https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&please=&dist=&mail=&sort=p&bl1_id=30#productlist

这是我目前拥有的代码:

#!/usr/bin/env python3
from urllib.request import Request, urlopen
from lxml import html
from lxml import etree


from lxml.etree import tostring


link = 'https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist'
link = 'https://geizhals.at/?cat=monlcd19wide&v=e&hloc=at&sort=p&bl1_id=30&xf=11939_23%7E11955_IPS%7E11963_240%7E14591_19201080'
link = 'https://geizhals.at/?cat=cpuamdam4&xf=25_6%7E5_PCIe+4.0%7E5_SMT%7E820_AM4'

def get_webSite():
    req = Request(link, headers={'User-Agent': 'Mozilla/5.0'})
    return  urlopen(req).read()





webpage = get_webSite() # Contains all HTML from the site
root = html.fromstring(webpage)




price = root.xpath("//*[@id=\"product0\"]/div[6]/span/span")[0].text.strip()
name = root.xpath("//*[@id=\"product0\"]/div[2]/a/span")[0].text.strip()
link = "https://geizhals.at/" + root.xpath("//*[@id=\"product0\"]/div[2]/a/@href")[0]
picture = root.xpath("//*[@id=\"product0\"]/div[1]/a/div/picture/img/@big-image-url")[0]
# the @ refers to the attribute of the selected element, / slashes seem to separate the searched terms
# The [0] refers to the first element of a list, we use this because xPath returns a list with exactly one item

price = price.lstrip('€ ') # removes the euro sign and the space
price = price.replace(',', '.') # removes the comma with a dot
price = float(price) # converts price string to float

print(f"Price : {price}")
print("Name : " + (name))
print("Link : " + (link))
print("PictureLink : " + (picture))

一切正常,除了图片缩略图的链接外,都会打印到控制台中。 我尝试了普通的xPath和完整的xPath,但无济于事。即使存在,也找不到这样的元素。

可能是什么问题?

python 网页抓取 xpath html 解析

评论

0赞 David542 2/2/2021
工作?..................

答:

1赞 David542 1/31/2021 #1

xpath 中的错误在于:

img/@big-image-url

它应该是:

img[@big-image-url]

否则,将遍历到 的子项,但您要检查标签本身的属性。下面是从页面中抓取所有图像的示例:/imgimg

import requests
from lxml import html
res=requests.get('https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist')
root = html.fromstring(res.content)
[item.attrib['big-image-url'] for item in root.xpath('//img[@big-image-url]')]
['https://gzhls.at/i/61/20/2436120-n0.jpg', 'https://gzhls.at/i/05/53/2430553-n0.jpg', 'https://gzhls.at/i/75/76/2237576-n0.jpg', 'https://gzhls.at/i/15/28/2201528-n0.jpg', 'https://gzhls.at/i/19/26/2221926-n0.jpg', 'https://gzhls.at/i/06/38/2410638-n0.jpg', 'https://gzhls.at/i/98/04/2459804-n0.jpg', 'https://gzhls.at/i/14/04/2201404-n0.jpg', 'https://gzhls.at/i/24/52/2132452-n0.jpg', 'https://gzhls.at/i/17/64/2401764-n0.jpg', 'https://gzhls.at/i/07/97/2350797-n0.jpg', 'https://gzhls.at/i/50/31/2365031-n0.jpg', 'https://gzhls.at/i/25/01/2322501-n0.jpg', 'https://gzhls.at/i/26/50/2152650-n0.jpg', 'https://gzhls.at/i/27/93/2202793-n0.jpg', 'https://gzhls.at/i/72/69/2267269-n0.jpg', 'https://gzhls.at/i/20/79/2142079-n0.jpg', 'https://gzhls.at/i/06/48/2430648-n0.jpg', 'https://gzhls.at/i/41/24/2294124-n0.jpg', 'https://gzhls.at/i/82/46/2378246-n0.jpg', 'https://gzhls.at/i/46/35/2124635-n0.jpg', 'https://gzhls.at/i/43/84/2304384-n0.jpg', 'https://gzhls.at/i/29/73/2382973-n0.jpg', 'https://gzhls.at/i/07/36/2410736-n0.jpg', 'https://gzhls.at/i/97/54/2459754-n0.jpg', 'https://gzhls.at/i/67/40/2456740-n0.jpg', 'https://gzhls.at/i/15/03/2151503-n0.jpg', 'https://gzhls.at/i/45/26/2244526-n0.jpg', 'https://gzhls.at/i/91/51/2089151-n0.jpg', 'https://gzhls.at/i/39/71/2393971-n0.jpg']

所以它应该存在于 html 属性中,例如:big-image-url

enter image description here