使用 Python 2.7 从 HTML 字符串中提取文件名

Extracing filenames from a string of HTML with Python 2.7

提问人:crmpicco 提问时间:2/4/2020 更新时间:2/4/2020 访问量:62

问:

我正在用 .BeautifulSoup

from bs4 import BeautifulSoup
import requests
import re
page = requests.get("http://www.crmpicco.co.uk/?page_id=82&lottoId=27")

soup = BeautifulSoup(page.content, 'html.parser')
entry_content = soup.find_all('div', class_='entry-content')

print(entry_content[1])

这给了我这个字符串:

<div class="entry-content"><span class="red">Week 27: </span><br/><br/>Saturday 1st February 2020<br/>(in red)<br/><br/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/lotto_balls/17.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/21.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/31.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/47.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/lotto2010/images/balls/bonus43.gif" vspace="12" width="70"/><br/><br/>Wednesday 5th February 2020<br/><br/><strong><span class="red">RESULTS NOT AVAILABLE</span></strong><br/><br/><br/><br/><a href="?page_id=82">Click here</a> to see other results.<br/> </div>

我想获取字符串中每个 gif 路径的文件名,并且我(认为)正则表达式模块中的方法是做到这一点的方法,但我没有取得多大成功。findall

最佳方法是什么?可以用 BeautifulSoup 一次通话完成吗?

则表达 python-2.7 beautifulsoup html解析

评论

0赞 Abdul Niyas P M 2/4/2020
您的预期产出是多少?
0赞 crmpicco 2/4/2020
@AbdulNiyasPM 理想情况下,我会有一个文件名数组(12、56、18、72、16 等)。我试图坚持完整的路径,然后从那里开始工作。免责声明:Python nubz0r。;)re.findall(r'src="(.*)/>', entry_content[1])
0赞 isopach 2/4/2020
您提供的 url 上没有带有 entry-content 的 divs,您确定它是正确的吗?
0赞 crmpicco 2/4/2020
@isopach 不,你是对的。我更改了 URL,但有些 div 的类为 .您可以在上面的字符串示例中看到 div 的类为 .entry-contententry-content

答:

0赞 Abdul Niyas P M 2/4/2020 #1

我建议使用标准库中的类 (python2/python3) 而不是正则表达式。它有一个handle_starttag方法,调用该方法来处理标记的开头。HTMLParser

>>> source = "\n".join(entry_content) # I assume "entry_content" is a list of div elements.
>>>
>>> try:
...     from HTMLParser import HTMLParser # python 2
... except ImportError:
...     from html.parser import HTMLParser
...
>>> class SrcParser(HTMLParser):
...     def __init__(self, *args, **kwargs):
...         self.links = []
...         self._basename = kwargs.pop('only_basename', False)
...         super(SrcParser, self).__init__(*args, **kwargs)
...
...     def handle_starttag(self, tag, attrs):
...         for attr, val in attrs:
...             if attr == 'src' and val.endswith("gif"):
...                 if self._basename:
...                     import os.path
...                     val = os.path.splitext(os.path.basename(val))[0]
...                 self.links.append(val)
...
>>> source_parser = SrcParser()
>>> source_parser.feed(source)
>>> print(*source_parser.links, sep='\n')
http://www.crmpicco.co.uk/wp-content/themes/2010/images/lotto_balls/17.gif
http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/21.gif
http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/31.gif
http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/47.gif
http://www.crmpicco.co.uk/wp-content/themes/lotto2010/images/balls/bonus43.gif
>>>
>>> source_parser = SrcParser(only_basename=True)
>>> source_parser.feed(source)
>>> print(*source_parser.links, sep='\n')
17
21
31
47
bonus43
0赞 isopach 2/4/2020 #2

我在您的页面上找不到任何 div,但这应该有效。将 .entry-contentcol-md-4entry-content

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import re


page = requests.get("http://www.crmpicco.co.uk/?page_id=82&lottoId=27")

soup = BeautifulSoup(page.content, 'html.parser')

for entry_content in soup.find_all('div',class_='col-md-4'):
    print(entry_content.img['src'].rsplit('/', 1)[-1].split('.')[0])
zce
691505
gaiq
0赞 dabingsou 2/4/2020 #3

我推荐另一种与 Python 2 和 python 3 兼容的解决方案,非常适合从 XML 中提取数据。

from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<div class="entry-content"><span class="red">Week 27: </span><br/><br/>Saturday 1st February 2020<br/>(in red)<br/><br/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/lotto_balls/17.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/21.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/31.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/47.gif" vspace="12" width="70"/> <img height="50" src="http://www.crmpicco.co.uk/wp-content/themes/lotto2010/images/balls/bonus43.gif" vspace="12" width="70"/><br/><br/>Wednesday 5th February 2020<br/><br/><strong><span class="red">RESULTS NOT AVAILABLE</span></strong><br/><br/><br/><br/><a href="?page_id=82">Click here</a> to see other results.<br/> </div>
'''
doc = SimplifiedDoc(html)
div = doc.select('div.entry-content')
srcs = div.selects('img>src()')
print (srcs)
print ([src.rsplit('/', 1)[-1].split('.')[0] for src in srcs])

结果:

['http://www.crmpicco.co.uk/wp-content/themes/2010/images/lotto_balls/17.gif', 'http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/21.gif', 'http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/31.gif', 'http://www.crmpicco.co.uk/wp-content/themes/2010/images/balls/47.gif', 'http://www.crmpicco.co.uk/wp-content/themes/lotto2010/images/balls/bonus43.gif']
['17', '21', '31', '47', 'bonus43']

以下是更多示例:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/