在 Python 中使用异步进行解析器优化

Parser optimization using asynchrony in Python

提问人:AnLucKa 提问时间:10/25/2023 更新时间:10/25/2023 访问量:12

问:

我的任务是解析来自网站的每秒数据,并检查是否有新消息。我还有一个条件,即解析器应该执行js。我试图用 requests_html 实现这一点,但当我意识到我的网站使用 HTTP2 时,该库不支持 HTTP2。我决定使用 pyppeteer 进行抓取。这是我的代码:

import asyncio
import datetime
from csv import DictWriter
from time import time

from pyppeteer import launch
from fake_useragent import UserAgent
from bs4 import BeautifulSoup


async def create_browser():
    browser = await launch({'headless': True,
                            'executablePath': '/usr/bin/google-chrome-stable'})
    return browser


async def scrape_page(browser, ua):
    page = await browser.newPage()
    await page.setUserAgent(ua.random)
    await page.goto(f'https://announcements.bybit.com/en-US/?category=&page=1&{int(time())}')
    visit_time = datetime.datetime.now()
    html_content = await page.content()
    return html_content, visit_time


class Parsing:
    def __init__(self, domain):
        self.domain = domain
        self.last_news_title = ""

    def __call__(self, html_code, visit_time):
        soup = BeautifulSoup(html_code, 'lxml')
        news_data = soup.select_one('a.no-style span:only-child')
        news_title = " ".join(news_data.text.strip().split())
        if news_title != self.last_news_title:
            self.last_news_title = news_title
            link = f"{self.domain}{news_data.find_parent('a')['href']}"
            d = {'time': visit_time, 'title': self.last_news_title, 'link': link}
            with open('data/news.csv', 'a') as file:
                writer = DictWriter(file, fieldnames=list(d.keys()), dialect='excel')
                writer.writerow(d)
            print("Новая новость записана")
        else:
            print("Новых новостей нет")


async def job(parsing, browser, ua):
    html_code, visit_time = await scrape_page(browser, ua)
    parsing(html_code, visit_time)


async def main():
    start = time()
    parsing = Parsing("https://announcements.bybit.com")
    our_browser = await create_browser()
    user_agent = UserAgent()
    try:
        while True:
            await job(parsing, our_browser, user_agent)
            await asyncio.sleep(1)
            print(time()-start)
    except:
        await our_browser.close()


if __name__ == '__main__':
    asyncio.run(main())

正如我在 time() 中注意到的那样,每次检查站点上的新数据的时间是 3-4 秒。这在我看来太长了。此外,我在 Selenium 上实现了类似的逻辑(据我所知,同步,它不支持异步),它显示了大致相同的结果。我的代码是异步的吗?

python-3.x 解析 selenium-webdriver python-asyncio pyppeteer

评论

0赞 Paul Cornelius 10/25/2023
您的代码不是异步的。asyncio.run 会自动创建一个任务。那将是.您永远不会创建第二个任务,因此没有多任务处理。请参阅文档以获取asyncio.create_task、asyncio。TaskGroup 或 loop.create_task(其中是正在运行的 asyncio 事件循环)。mainloop

答: 暂无答案