提问人:Dave 提问时间:10/5/2020 更新时间:10/7/2020 访问量:814
在 BeautifulSoup 中,使用带有 lxml 解析的过滤器的正确方法是什么?
In BeautifulSoup, what's the proper way to use a strainer with lxml parsing?
问:
我正在使用 Beautiful Soup 4 和 Python 3.8。我只想解析 HTML 页面中的某些元素,所以我决定使用这样的过滤器......
req = urllib2.Request(full_url, headers=settings.HDR)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
,,,
@staticmethod
def idiom_match_strainer(elem, attrs):
if elem == 'ul' and 'class' in attrs and attrs['class'] == 'idiKw':
return True
return False
不幸的是,当我尝试解析任何 URL(https://idioms.thefreedictionary.com/testing 是一个示例)时,我收到以下错误
Internal Server Error: /ajax/get_hints
Traceback (most recent call last):
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 126, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/views.py", line 194, in get_hints
objects = s.get_hints(article)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/article_service.py", line 398, in get_hints
idioms = DictionaryService.get_idioms(word)
File "/Users/davea/Documents/workspace/dictionary_project/dictionary/services/dictionary_service.py", line 75, in get_idioms
soup = BeautifulSoup(html, features="lxml", parse_only=DictionaryService.idiom_match_strainer)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 281, in __init__
self._feed()
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 342, in _feed
self.builder.feed(self.markup)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 287, in feed
self.parser.feed(markup)
File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
File "src/lxml/parsertarget.pxi", line 148, in lxml.etree._TargetParserContext._handleParseResult
File "src/lxml/parsertarget.pxi", line 136, in lxml.etree._TargetParserContext._handleParseResult
File "src/lxml/etree.pyx", line 314, in lxml.etree._ExceptionContext._raise_if_stored
File "src/lxml/saxparser.pxi", line 389, in lxml.etree._handleSaxTargetStartNoNs
File "src/lxml/saxparser.pxi", line 404, in lxml.etree._callTargetSaxStart
File "src/lxml/parsertarget.pxi", line 80, in lxml.etree._PythonSaxParserTarget._handleSaxStart
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/builder/_lxml.py", line 220, in start
self.soup.handle_starttag(name, namespace, nsprefix, attrs)
File "/Users/davea/Documents/workspace/dictionary_project/venv/lib/python3.8/site-packages/bs4/__init__.py", line 582, in handle_starttag
and (self.parse_only.text
AttributeError: 'function' object has no attribute 'text'
我应该用其他方式使用过滤器吗?
答:
2赞
Martin Honnen
10/7/2020
#1
使用包装中的 SoupStrainer 就足够了:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
html = '<html><body><section><ul class="foo"><li>a<li>b</ul><ul><li>1<li>2</ul></section><ul class="foo"><li>c<li>d</ul></body></html>'
soup = BeautifulSoup(html, features="lxml", parse_only=SoupStrainer('ul', class_='foo'))
print(soup.prettify())
给
<ul class="foo">
<li>
a
</li>
<li>
b
</li>
</ul>
<ul class="foo">
<li>
c
</li>
<li>
d
</li>
</ul>
所以我想,对于你的电话,你想要。parse_only=SoupStrainer('ul', class_='idiKw')
评论
idiom_match_strainer