如何从美汤元素中迭代检索正确的信息?

How to iteratively retrieve the right information from beautiful soup elements?

提问人:Nick 提问时间:7/18/2023 最后编辑:YounexNick 更新时间:7/18/2023 访问量:48

问:

我尝试从 EZB 新闻稿中检索信息。为此,我使用 BeautifulSoup。由于新闻稿的结构 (HTML) 会随着时间的推移而变化,因此很难使用单个选择器检索新闻稿的日期。因此,我尝试使用“try and except”以及“if/else 语句”从所有 HTML 文件中检索日期。不幸的是,我的代码没有按照我希望的方式工作,因为我没有从所有新闻稿中获得足够的日期。

有谁知道如何遍历多个汤元素并选择正确的元素从相应的 HTML 文件中选择日期?

附上我的代码:

from pandas.core.internals.managers import ensure_block_shape
import bs4, requests

pr_list = []

def parseContent(Urls):
  for x in Urls:
   res = requests.get(x)
   article = bs4.BeautifulSoup(res.text, 'html.parser')
   try:
    date = article.select('#main-wrapper > main > div.section > p.ecb-publicationDate')
    if date:
      for x in date:
        date = x.text.strip()   
    date = article.select('#main-wrapper > main > div.ecb-pressContentPubDate')
    if date:
      for x in date:
          date = x.text.strip()     
    else:
      date = article.select('#main-wrapper > main > div.title > ul > li.ecb-publicationDate')
      for x in date:
          date = x.text.strip()
   except:
    date = None
   try:
    title = article.select('#main-wrapper > main > div.title > h1')
    for x in title:
      title = x.text.strip()
   except:
    title = None
   try:
    body = article.select("#main-wrapper > main > div.section")
    for x in body:
      body = x.text.strip()
   except:
    body = None
   row = [date,title,body]
   pr_list.append(row)
python 解析 beautifulsoup 文本

评论

0赞 John Gordon 7/18/2023
如果您提供此代码无法识别的元素的具体示例,将会有所帮助。
0赞 Nick 7/18/2023
链接可以在我的github上找到:github.com/nickdoesthetrick/ezb/blob/...

答:

1赞 larsks 7/18/2023 #1

将匹配表达式存储在列表中,然后循环访问它们,直到一个表达式成功:

import bs4
import requests


date_expressions = [
    "#main-wrapper > main > div.section > p.ecb-publicationDate",
    "#main-wrapper > main > div.ecb-pressContentPubDate",
    "#main-wrapper > main > div.title > ul > li.ecb-publicationDate",
]

title_expressions = [
    "#main-wrapper > main > div.title > h1",
]

body_expressions = [
    "#main-wrapper > main > div.section",
]


def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    if len(res) > 1:
        raise ValueError("failed to match a unique value")

    return res[0]


def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        date = try_several_expressions(article, date_expressions).text
        title = try_several_expressions(article, title_expressions).text
        body = try_several_expressions(article, body_expressions).text

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

假设您的意思是“ECB”而不是“EZB”,我针对 https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html 对此进行了测试,它似乎按预期工作。


如果我进行我在评论中建议的一项更改(删除检查),则如下所示:if len(res) > 1try_several_expressions

def try_several_expressions(article, expressions):
    """Try to match an element using the given list of expressions.

    Raise ValueError if we failed to find any matches or if we find
    multiple matches.
    """

    for expr in expressions:
        res = article.select(expr)
        if res:
            break
    else:
        raise ValueError("failed to match any expressions")

    # Always return the first matched element
    return res[0]

然后,该脚本适用于列表中的每个 URL,但 https://www.ecb.europa.eu/press/pr/date/2020/html/ecb.pr2002242~8842dcb418.en.html 除外,它没有任何内容。

如果将块放入 ,则可以简单地忽略该故障:try/exceptparseContent

def parseContent(urls):
    pr_list = []
    for url in urls:
        res = requests.get(url)
        article = bs4.BeautifulSoup(res.text, "html.parser")
        try:
            date = try_several_expressions(article, date_expressions).text.strip()
            title = try_several_expressions(article, title_expressions).text.strip()
            body = try_several_expressions(article, body_expressions).text
        except ValueError:
            print(f'failed to parse: {url}')
            continue

        row = [date, title, body]
        pr_list.append(row)

    return pr_list

评论

0赞 larsks 7/18/2023
此代码似乎适用于所有这三个 URL。
0赞 Nick 7/18/2023
嗨,对于我的小链接子集,它似乎有效。如果我输入所有 url(更大的集合),我会得到“ValueError:无法匹配唯一值”。有没有一种好方法可以与您共享完整的链接列表?
0赞 larsks 7/18/2023
没有必要。如果您收到“无法匹配唯一值”错误,则表示您的选择器正在匹配多个元素。只需在 之前粘贴一个,然后尝试匹配表达式,直到找到一个有效的表达式。或者,去掉检查并始终返回第一个匹配项(并交叉手指表示这是您想要的匹配项)。breakpoint()raiseif len(res) > 1
0赞 Nick 7/18/2023
嗨,larsks,当我添加 breakpoint() 时,它会多次要求我提供匹配表达式。如果我删除“if len(res) > 1”,那么我也会收到一个错误,循环停止。有没有办法在这里以 google colab 或 github 的形式分享我的代码?
1赞 Zero 7/18/2023 #2

改进了代码,如下所示:

  • 删除了不必要的 try-except 块
  • 减少了复杂的逻辑和选择器,并用静态选择器和基于正则表达式的动态选择器取而代之。
from bs4 import BeautifulSoup
from pprint import pprint
import re
import requests

pr_list = []

urls = [
    'https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html',
    'https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html'
]

def parse_content(urls):
    for url in urls:
        print(url)
        res = requests.get(url)
        page = BeautifulSoup(res.text, 'html.parser')

        # initializing default values
        row = [None ,None ,None]
        
        #for dates
        if page.find('main').find(attrs={'class': re.compile('Date')}, string=re.compile('\d+ (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}')):
            row[0] = page.find('main').find(attrs={'class': re.compile('Date')}, string=re.compile('\d+ (January|February|March|April|May|June|July|August|September|October|November|December) \d{4}')).text.strip()
        
        
        # getting title
        row[1] = page.find('div', {'class': 'title'}).find('h1').text.strip() if page.find('div', {'class': 'title'}) and page.find('div', {'class': 'title'}).find('h1') else None
        
        # getting body
        row[2] = page.find('main').find('div', {'class': 'section'}).text.strip() if page.find('div', {'class': 'section'}) else None
        
        pr_list.append(row)


parse_content(urls)
pprint(pr_list)

请注意,我使用正则表达式来查找日期,因为在您提供的示例中,日期遵循此模式,并且在标记中包含它们的类名。Datemain

输出为

https://www.ecb.europa.eu/press/pr/date/2023/html/ecb.pr230710~77cf718c59.en.html
https://www.ecb.europa.eu/press/pr/date/2012/html/pr120912_1.en.html
[['10 July 2023',
  'ECB surveys Europeans on new themes for euro banknotes',
  '10 July 2023Europeans invited to express preferences on shortlisted themes '
  'in public survey open until 31\xa0August 2023ECB’s Governing Council '
  'expected to choose future theme by 2024, and final designs in 2026The '
  'European Central Bank (ECB) is asking European citizens about their views '
  'on the proposed themes for the next series of euro banknotes. From 10 July '
  'until 31 August 2023 everybody in the euro area can respond to a survey on '
  'the ECB’s website. In addition, to ensure opinions from across the euro '
  'area are equally represented, the ECB has contracted an independent '
  'research company to ask a representative sample of people in the euro area '
  'the same questions as those in its own survey.ECB President Christine '
  'Lagarde invites everybody to participate in the survey. She said “There is '
  'a strong link between our single currency and our shared European identity, '
  'and our new series of banknotes should emphasise this. We want Europeans to '
  'identify with the design of euro banknotes, which is why they will play an '
  'active role in selecting the new theme.”Developing our future euro '
  'banknotes“We are working on a new series of high-tech banknotes with a view '
  'to preventing counterfeiting and reducing environmental impact,” said '
  'Executive Board member Fabio Panetta. “We are committed to cash and to '
  'ensuring that paying with public money is always an option.”It is the duty '
  'of the ECB and the euro area national central banks to ensure euro '
  'banknotes remain an innovative, secure and efficient means of payment. '
  'Developing new series of banknotes is a standard practice for all central '
  'banks. In a world where reproduction technologies are rapidly evolving and '
  'where counterfeiters can easily access information and materials, it is '
  'necessary to issue new banknotes on a regular basis. Beyond security '
  'considerations, the ECB is committed to reducing the environmental impact '
  'of euro banknotes throughout their life cycle, while also making them more '
  'relatable and inclusive for Europeans of all ages and backgrounds, '
  'including vulnerable groups such as people with visual '
  'impairment.Shortlisted themes for future banknotesThe seven themes '
  'shortlisted by the ECB’s Governing Council are listed below.[1]Birds: free, '
  'resilient, inspiringBirds know nothing of national borders and symbolise '
  'freedom of movement. Their nests remind us of our own desire to build '
  'places and societies that nurture and protect the future. They remind us '
  'that we share our continent with all the lifeforms that sustain our common '
  'existence.European cultureEurope’s rich cultural heritage and dynamic '
  'cultural and creative sectors strengthen the European identity, forging a '
  'shared sense of belonging. Culture promotes common values, inclusion and '
  'dialogue in Europe and across the globe. It brings people together.European '
  'values mirrored in natureEurope is a living place, but also an idea. The '
  'European Union is an organisation, but also a set of values. The theme '
  'highlights the role of European values (human dignity, freedom, democracy, '
  'equality, the rule of law and human rights) as the building blocks of '
  'Europe and links these values to our respect for nature and the '
  'preservation of the environment.The future is yoursThe ideas and '
  'innovations that will shape the future of Europe lie deep within every '
  'European. The images created for this theme represent the bearers of the '
  'collective imagination through which people will create this shared future. '
  'This theme signifies the boundless potential of Europeans.Hands: together '
  'we build EuropeHands are familiar to all of us but no two pairs are the '
  'same. Hands built Europe, its physical infrastructure, its artistic '
  'heritage and its achievements. Hands build, weave, heal, teach, connect and '
  'guide us. Hands tell stories of labour, age and relationships, of heritage, '
  'history, and culture. This theme celebrates the hands that have built '
  'Europe and continue to do so every day.\xa0Our Europe, ourselvesWe grow up '
  'as individuals but also as part of a community, through our relationships '
  'with one another. We have our own stories and identities, but we also share '
  'a common identity as Europeans. This theme evokes the freedom, values and '
  "openness of people in Europe.Rivers: the waters of life in EuropeEurope's "
  'rivers cross borders. They connect us to each other and to nature. They '
  'represent the ebb and flow of a dynamic, ever-changing continent. They '
  'nurture us and remind us of the deep sources of our common life, and we '
  'must nurture them in turn.The shortlist of themes takes into account the '
  'suggestions made by a multidisciplinary advisory group, with members from '
  'all euro area countries.Timeline for the new designsThe outcome of the '
  'surveys will be used by the ECB to select the theme for the next generation '
  'of banknotes by 2024. After that a design competition will take place. '
  'European citizens will again have the chance to express their preferences '
  'on the design options resulting from that competition. The ECB is expected '
  'to take the decision on the future design, and on when to produce and issue '
  'the new banknotes, in 2026.For media queries, please contact Belén Pérez '
  'Esteve, tel.: +49 173 533 4269.'],
 ['12 September 2012',
  'ECB extends the swap facility agreement \u2028with the Bank of England',
  'The Governing Council of the European Central Bank (ECB) has decided, in '
  'agreement with the Bank of England, to extend the liquidity swap '
  'arrangement with the Bank of England up to \u2028'
  '30 September 2013. The swap facility agreement established on 17 December '
  '2010 had been authorised until the end of September 2011 and then extended '
  'until 28 September 2012.\n'
  'The related announcement by the Bank of England is available at their '
  'website http://www.bankofengland.co.uk.']]