为什么“requests-html”不能呈现所有 HTML 内容?

Why is "requests-html" not rendering all HTML content?

提问人:Ahmad Abdelbaset 提问时间:5/21/2023 最后编辑:Ahmad Abdelbaset 更新时间:6/18/2023 访问量:99

问:

我正在尝试抓取数据,但脚本并未加载所有 html 内容,尽管我更改了渲染时间。请看下面的代码:

from requests_html import HTMLSession, AsyncHTMLSession

url = 'https://www.aliexpress.com/w/wholesale-test.html?catId=0&initiative_id=SB_20230516115154&SearchText=test&spm=a2g0o.home.1000002.0'


def create_session(url):
    session = HTMLSession()
    request = session.get(url)
    print("Before   ",len(request.html.html),"\n\n")
    request.html.render(sleep=5,timeout=20) #Because it is dynamic website, will wait until to load the page
    prod = request.html.find('#root > div > div > div.right--container--1WU9aL4.right--hasPadding--52H__oG > div > div.content--container--2dDeH1y > div.list--gallery--34TropR > a:nth-child(1) > div.manhattan--content--1KpBbUi')
    print("After   ",len(request.html.html),"\n\n")
    print("output:",prod)
    session.close()

create_session(url)

当我第一次运行代码时,输出是:

Before  55448

After   542927

output: [<Element 'div' class=('manhattan--content--1KpBbUi',)>]

当我再次运行程序时(不更改代码中的任何内容),我得到:

Before  55448  
 
After   251734   

output: []

当我将睡眠时间从 5 更改为 100: to 时,我也收到了类似的输出:request.html.render(sleep=5,timeout=20)request.html.render(sleep=100,timeout=20)

Before  55448   

After   242881   

output: []

它不会呈现所有 html 内容

python 网页抓取 beautifulsoup python-requests html-rendering

评论


答: 暂无答案