问：

好的，所以我正在研究一个正则表达式来搜索网站中的所有标题信息。

我编译了正则表达式：

regex = re.compile(r'''
    <h[0-9]>\s?
    (<a[ ]href="[A-Za-z0-9.]*">)?\s?
    [A-Za-z0-9.,:'"=/?;\s]*\s?
    [A-Za-z0-9.,:'"=/?;\s]?
''',  re.X)

当我在 python reg ex. tester 中运行它时，它的效果非常好。

示例数据：

<body>
    <h1>Dog </h1>
    <h2>Cat </h2>
    <h3>Fancy </h3>
    <h1>Tall cup of lemons</h1>
    <h1><a href="dog.com">Dog thing</a></h1>
</body>

现在，在 REDemo 中，它运行得非常好。

但是，当我将其放入我的 python 代码中时，它只会打印<a href="dog.com">

这是我的 python 代码，我不确定我是否做错了什么，或者翻译中是否丢失了某些东西。感谢您的帮助。

stories=[]
response = urllib2.urlopen('http://apricotclub.org/duh.html')
html = response.read().lower()
p = re.compile('<h[0-9]>\\s?(<a href=\"[A-Za-z0-9.]*\">)?\\s?[A-Za-z0-9.,:\'\"=/?;\\s]*\\s?[A-Za-z0-9.,:\'\"=/?;\\s]?')
stories=re.findall(p, html)
for i in stories:
    if len(i) >= 5:
        print i

我还应该注意，当我从正则表达式中取出时，它适用于非链接行。(<a href=\"[A-Za-z0-9.]*\">)?<hN>

Python HTML 正则表达式

问：如何使用正则表达式解析 HTML？

答：请不要。

使用BeautifulSoup，html5lib或lxml.html。请。

import re

html = '''
<body>

<h1>Dog </h1>
<h2>Cat </h2>
<h3>Fancy </h3>
<h1>Tall cup of lemons</h1>
<h1><a href="dog.com">Dog thing</a></h1>
</body>
'''

p = re.compile(r'''
    <(?P<header>h[0-9])>             # store header tag for later use
    \s*                              # zero or more whitespace
    (<a\shref="(?P<href>.*?)">)?     # optional link tag. store href portion
    \s*
    (?P<title>.*?)                   # title
    \s*
    (</a>)?                          # optional closing link tag
    \s*
    </(?P=header)>                   # must match opening header tag
''', re.IGNORECASE + re.VERBOSE)

stories = p.finditer(html)

for match in stories:
    print '%(title)s [%(href)s]' % match.groupdict()

以下是一些不错的正则表达式资源：

from BeautifulSoup import BeautifulSoup


H_TAGS = ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']

def extract_data():
   """Extract the data from all headers
   in a HTML page."""
   f = open('foo.html', 'r+')
   html = f.read()
   soup = BeautifulSoup(html)
   headers = [soup.findAll(h) for h in H_TAGS if soup.findAll(h)]
   lst = []
   for x in headers:
      for y in x:
         if y.string:
            lst.append(y.string)
         else:
            lst.append(y.contents[0].string)
   return lst

上面的函数返回：

>>> [u'Dog ', u'Tall cup of lemons', u'Dog thing', u'Cat ', u'Fancy ']

您可以在列表中添加任意数量的标题标记h_tags。我已经假设了所有的标题。如果您可以使用 BeautifulSoup 轻松解决问题，那么最好使用它。:)

上一个：使 div 填充剩余屏幕空间的高度

下一个：如何通过Web资源在d365中添加任何实体的主窗体选项卡？[关闭]

python 中的正则表达式问题

Issue with Regular expressions in python

评论

问：如何使用正则表达式解析 HTML？

答：请不要。

评论

评论

评论