Python Web 抓取脚本无法通过 xPath 找到元素，即使它存在-解网

问：

目前，我正在编写一个小脚本，该脚本应根据我国价格比较网站的链接提取最便宜产品的名称，链接，价格和图像。

示例链接如下所示：https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&please=&dist=&mail=&sort=p&bl1_id=30#productlist

这是我目前拥有的代码：

#!/usr/bin/env python3
from urllib.request import Request, urlopen
from lxml import html
from lxml import etree


from lxml.etree import tostring


link = 'https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist'
link = 'https://geizhals.at/?cat=monlcd19wide&v=e&hloc=at&sort=p&bl1_id=30&xf=11939_23%7E11955_IPS%7E11963_240%7E14591_19201080'
link = 'https://geizhals.at/?cat=cpuamdam4&xf=25_6%7E5_PCIe+4.0%7E5_SMT%7E820_AM4'

def get_webSite():
    req = Request(link, headers={'User-Agent': 'Mozilla/5.0'})
    return  urlopen(req).read()





webpage = get_webSite() # Contains all HTML from the site
root = html.fromstring(webpage)




price = root.xpath("//*[@id=\"product0\"]/div[6]/span/span")[0].text.strip()
name = root.xpath("//*[@id=\"product0\"]/div[2]/a/span")[0].text.strip()
link = "https://geizhals.at/" + root.xpath("//*[@id=\"product0\"]/div[2]/a/@href")[0]
picture = root.xpath("//*[@id=\"product0\"]/div[1]/a/div/picture/img/@big-image-url")[0]
# the @ refers to the attribute of the selected element, / slashes seem to separate the searched terms
# The [0] refers to the first element of a list, we use this because xPath returns a list with exactly one item

price = price.lstrip('€ ') # removes the euro sign and the space
price = price.replace(',', '.') # removes the comma with a dot
price = float(price) # converts price string to float

print(f"Price : {price}")
print("Name : " + (name))
print("Link : " + (link))
print("PictureLink : " + (picture))

一切正常，除了图片缩略图的链接外，都会打印到控制台中。我尝试了普通的xPath和完整的xPath，但无济于事。即使存在，也找不到这样的元素。

可能是什么问题？

python 网页抓取 xpath html 解析

import requests
from lxml import html
res=requests.get('https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist')
root = html.fromstring(res.content)
[item.attrib['big-image-url'] for item in root.xpath('//img[@big-image-url]')]
['https://gzhls.at/i/61/20/2436120-n0.jpg', 'https://gzhls.at/i/05/53/2430553-n0.jpg', 'https://gzhls.at/i/75/76/2237576-n0.jpg', 'https://gzhls.at/i/15/28/2201528-n0.jpg', 'https://gzhls.at/i/19/26/2221926-n0.jpg', 'https://gzhls.at/i/06/38/2410638-n0.jpg', 'https://gzhls.at/i/98/04/2459804-n0.jpg', 'https://gzhls.at/i/14/04/2201404-n0.jpg', 'https://gzhls.at/i/24/52/2132452-n0.jpg', 'https://gzhls.at/i/17/64/2401764-n0.jpg', 'https://gzhls.at/i/07/97/2350797-n0.jpg', 'https://gzhls.at/i/50/31/2365031-n0.jpg', 'https://gzhls.at/i/25/01/2322501-n0.jpg', 'https://gzhls.at/i/26/50/2152650-n0.jpg', 'https://gzhls.at/i/27/93/2202793-n0.jpg', 'https://gzhls.at/i/72/69/2267269-n0.jpg', 'https://gzhls.at/i/20/79/2142079-n0.jpg', 'https://gzhls.at/i/06/48/2430648-n0.jpg', 'https://gzhls.at/i/41/24/2294124-n0.jpg', 'https://gzhls.at/i/82/46/2378246-n0.jpg', 'https://gzhls.at/i/46/35/2124635-n0.jpg', 'https://gzhls.at/i/43/84/2304384-n0.jpg', 'https://gzhls.at/i/29/73/2382973-n0.jpg', 'https://gzhls.at/i/07/36/2410736-n0.jpg', 'https://gzhls.at/i/97/54/2459754-n0.jpg', 'https://gzhls.at/i/67/40/2456740-n0.jpg', 'https://gzhls.at/i/15/03/2151503-n0.jpg', 'https://gzhls.at/i/45/26/2244526-n0.jpg', 'https://gzhls.at/i/91/51/2089151-n0.jpg', 'https://gzhls.at/i/39/71/2393971-n0.jpg']

所以它应该存在于 html 属性中，例如：big-image-url

上一个：不包含子节点的 XPath 节点

下一个：使用 jsoup 或任何其他库通过原始 xpath 从 HTML 中删除元素

Python Web 抓取脚本无法通过 xPath 找到元素，即使它存在

Python web scraping script does not find element by xPath even though it exists

评论