提问人:AnLucKa 提问时间:10/25/2023 更新时间:10/25/2023 访问量:12
在 Python 中使用异步进行解析器优化
Parser optimization using asynchrony in Python
问:
我的任务是解析来自网站的每秒数据,并检查是否有新消息。我还有一个条件,即解析器应该执行js。我试图用 requests_html 实现这一点,但当我意识到我的网站使用 HTTP2 时,该库不支持 HTTP2。我决定使用 pyppeteer 进行抓取。这是我的代码:
import asyncio
import datetime
from csv import DictWriter
from time import time
from pyppeteer import launch
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
async def create_browser():
browser = await launch({'headless': True,
'executablePath': '/usr/bin/google-chrome-stable'})
return browser
async def scrape_page(browser, ua):
page = await browser.newPage()
await page.setUserAgent(ua.random)
await page.goto(f'https://announcements.bybit.com/en-US/?category=&page=1&{int(time())}')
visit_time = datetime.datetime.now()
html_content = await page.content()
return html_content, visit_time
class Parsing:
def __init__(self, domain):
self.domain = domain
self.last_news_title = ""
def __call__(self, html_code, visit_time):
soup = BeautifulSoup(html_code, 'lxml')
news_data = soup.select_one('a.no-style span:only-child')
news_title = " ".join(news_data.text.strip().split())
if news_title != self.last_news_title:
self.last_news_title = news_title
link = f"{self.domain}{news_data.find_parent('a')['href']}"
d = {'time': visit_time, 'title': self.last_news_title, 'link': link}
with open('data/news.csv', 'a') as file:
writer = DictWriter(file, fieldnames=list(d.keys()), dialect='excel')
writer.writerow(d)
print("Новая новость записана")
else:
print("Новых новостей нет")
async def job(parsing, browser, ua):
html_code, visit_time = await scrape_page(browser, ua)
parsing(html_code, visit_time)
async def main():
start = time()
parsing = Parsing("https://announcements.bybit.com")
our_browser = await create_browser()
user_agent = UserAgent()
try:
while True:
await job(parsing, our_browser, user_agent)
await asyncio.sleep(1)
print(time()-start)
except:
await our_browser.close()
if __name__ == '__main__':
asyncio.run(main())
正如我在 time() 中注意到的那样,每次检查站点上的新数据的时间是 3-4 秒。这在我看来太长了。此外,我在 Selenium 上实现了类似的逻辑(据我所知,同步,它不支持异步),它显示了大致相同的结果。我的代码是异步的吗?
答: 暂无答案
评论
main
loop