提问人:Andriy Stolyar 提问时间:7/6/2016 最后编辑:Vadim KotovAndriy Stolyar 更新时间:1/21/2020 访问量:1240
如何使用报纸库仅解析网站的特定类别?
How to parse only a specific category of a website using the newspaper library?
问:
我使用和库。据说这个库可以创建一个对象,该对象是新闻网站的抽象。但是,如果我只需要某个类别的抽象呢?Python3
newspaper
Source
例如,当我使用此 url 时,我想获取该类别的所有文章。相反,我从 .'technology'
'politics'
我认为在创建对象时,报纸只使用域名,在我的情况下是)。Source
www.kyivpost.com
有没有办法让它与这样的网址一起使用?http://www.kyivpost.com/technology/
答:
0赞
Joe Woods
8/11/2016
#1
newspaper
将使用网站的 RSS 提要(如果可用);KyivPost 只有一个 rss 提要,主要发布关于政治的文章,这就是为什么您的结果集主要是政治。
您可能会更幸运地使用专门从技术页面绘制文章 URL 并将它们直接提供给它们。BeautifulSoup
newspaper
0赞
Prakhar Jhudele
1/21/2020
#2
我知道这有点老了。但是,如果有人仍在寻找这样的东西,您可以先获取所有锚标记元素,使用正则表达式过滤链接,然后请求文章的所有链接+所需数据。我正在粘贴一个示例代码,您可以根据您的页面更改必要的汤元素-
'''
"""
Created on Tue Jan 21 10:10:02 2020
@author: prakh
"""
import requests
#import csv
from bs4 import BeautifulSoup
import re
from functools import partial
from operator import is_not
from dateutil import parser
import pandas as pd
from datetime import timedelta, date
final_url = 'https://www.kyivpost.com/technology'
links = []
news_data = []
filter_null = partial(filter, partial(is_not, None))
try:
page = requests.get(final_url)
soup = BeautifulSoup(page.text, 'html.parser')
last_links = soup.find(class_='filter-results-archive')
artist_name_list_items = last_links.find_all('a')
for artist_name in artist_name_list_items:
links.append(artist_name.get('href'))
L =list(filter_null(links))
regex = re.compile(r'technology')
selected_files = list(filter(regex.match, L))
# print(selected_files)
# print(list(page))
except Exception as e:
print(e)
print("continuing....")
# continue
for url in selected_files:
news_category = url.split('/')[-2]
try:
data = requests.get(url)
soup = BeautifulSoup(data.content, 'html.parser')
last_links2 = soup.find(id='printableAreaContent')
last_links3 = last_links2.find_all('p')
# metadate = soup.find('meta', attrs={'name': 'publish-date'})['content']
#print(metadate)
# metadate = parser.parse(metadate).strftime('%m-%d-%Y')
# metaauthor = soup.find('meta', attrs={'name': 'twitter:creator'})['content']
news_articles = [{'news_headline': soup.find('h1',
attrs={"class": "post-title"}).string,
'news_article': last_links3,
# 'news_author': metaauthor,
# 'news_date': metadate,
'news_category': news_category}
]
news_data.extend(news_articles)
# print(list(page))
except Exception as e:
print(e)
print("continuing....")
continue
df = pd.DataFrame(news_data)
'''
评论