你能提供解析 HTML 的示例吗？-解网

问：

如何使用各种语言和解析库解析 HTML？

回答时：

在回答有关如何使用正则表达式解析 HTML 的问题时，将链接到个人注释，以显示正确的做事方式。

为了保持一致性，我要求该示例解析 in 锚标记的 HTML 文件。为了便于搜索这个问题，我要求您遵循以下格式href

语言：[语言名称]

库：[库名称]

[example code]

请将库设置为指向库文档的链接。如果您想提供提取链接以外的示例，请同时包括：

用途：[解析的作用]

与语言无关 HTML 解析

0赞 dfa 4/22/2009

对每个示例重复 HTML 构建器代码是没有意义的

0赞 dfa 4/22/2009

为什么你要用毫无意义/无用的使用指令来杂乱无章地编写 Perl 代码？（警告和严格）

4赞 Chas. Owens 4/22/2009

自给自足的工作示例更好。所有 Perl 代码都应该包含严格和警告，它们并非毫无意义;它们是现代 Perl 的一部分。如果你认为你的代码是“毫无意义”和“无用的”，我会不寒而栗地想到它们是什么样子的。

0赞 dfa 4/22/2009

在我的代码中，我总是使用警告和严格;在这种情况下，它们毫无意义。这些示例中的大多数都不是“自包含的”（例如jquery，ruby和其他答案），那么为什么要为基于perl的解决方案而烦恼呢？

0赞 Chas. Owens 4/22/2009

因为您可以，并且 JavaScript 示例在其环境中是自包含的。我没有更改 nokogiri 示例，因为我无法将 nokogiri 安装在我的机器上。我不想更改我不理解的代码。但我会改变它;一方面，它看起来不像是在解决这个例子。至于使用严格的，为正在学习的人建模不安全的代码是一种犯罪。他们需要所有他们能得到的增援。

答：

12赞 Chas. Owens 4/21/2009 #1

语言： Python
库： HTMLParser

#!/usr/bin/python

from HTMLParser import HTMLParser

class FindLinks(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)

    def handle_starttag(self, tag, attrs):
        at = dict(attrs)
        if tag == 'a' and 'href' in at:
            print at['href']


find = FindLinks()

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

find.feed(html)

11赞 3 revs, 2 users 98%Chas. Owens #2

语言： Perl
库： HTML：:P arser

#!/usr/bin/perl

use strict;
use warnings;

use HTML::Parser;

my $find_links = HTML::Parser->new(
    start_h => [
        sub {
            my ($tag, $attr) = @_;
            if ($tag eq 'a' and exists $attr->{href}) {
                print "$attr->{href}\n";
            }
        }, 
        "tag, attr"
    ]
);

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

$find_links->parse($html);

0赞 Tanktalus 4/22/2009

使用 LWP：：Simple 下载此页面（就像我在下面的 perl 示例中所做的那样）显示您找到了没有 href（但有名称）的 a，所以我们只想在打印之前检查是否有 href。

14赞 2 revs, 2 users 86%Pesto #3

语言： Ruby
库： Hpricot

#!/usr/bin/ruby

require 'hpricot'

html = '<html><body>'
['foo', 'bar', 'baz'].each {|link| html += "<a href=\"http://#{link}.com\">#{link}</a>" }
html += '</body></html>'

doc = Hpricot(html)
doc.search('//a').each {|elm| puts elm.attributes['href'] }

5赞 3 revs, 2 users 91%Tanktalus #4

语言： Perl
库： XML：：Twig

#!/usr/bin/perl
use strict;
use warnings;
use Encode ':all';

use LWP::Simple;
use XML::Twig;

#my $url = 'http://stackoverflow.com/questions/773340/can-you-provide-an-example-of-parsing-html-with-your-favorite-parser';
my $url = 'http://www.google.com';
my $content = get($url);
die "Couldn't fetch!" unless defined $content;

my $twig = XML::Twig->new();
$twig->parse_html($content);

my @hrefs = map {
    $_->att('href');
} $twig->get_xpath('//*[@href]');

print "$_\n" for @hrefs;

警告：像这样的页面可能会出现宽字符错误（将 url 更改为注释掉的 url 会得到此错误），但上面的 HTML：:P arser 解决方案没有这个问题。

0赞 Chas. Owens 4/22/2009

很好，我一直在使用 XML：：Twig，但从未意识到有一种parse_html方法。

15赞 3 revs, 2 users 73%depesz #5

语言： shell 库： lynx （好吧，它不是库，但在 shell
中，每个程序都是库）

lynx -dump -listonly http://news.google.com/

0赞 Chas. Owens 4/22/2009

+1 表示尝试，+1 表示工作解决方案，-1 表示解决方案无法推广到其他任务：净值 +1

7赞 4/22/2009

好吧，任务定义得很好 - 它必须从“A”标签中提取链接。:)

0赞 Chas. Owens 4/22/2009

是的，但它被定义为一个示例来展示如何解析，我可以很容易地要求您打印具有类“phonenum”的 <td> 标签的所有内容。

3赞 Tanktalus 4/22/2009

我同意这对一般问题没有帮助，但具体问题可能是一个受欢迎的问题，所以在我看来，对于一般问题的特定领域来说，这似乎是合理的。

22赞 5 revs, 4 users 82%Paolo Bergantino #6

语言： Python
库： BeautifulSoup

from BeautifulSoup import BeautifulSoup

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

soup = BeautifulSoup(html)
links = soup.findAll('a', href=True) # find <a> with a defined href attribute
print links

输出：

[<a href="http://foo.com">foo</a>,
 <a href="http://bar.com">bar</a>,
 <a href="http://baz.com">baz</a>]

也可能：

for link in links:
    print link['href']

输出：

http://foo.com
http://bar.com
http://baz.com

0赞 Chas. Owens 4/22/2009

这很好，但 BeautifulSoup 是否提供了一种查看标签以获取属性的方法？去看文档

1赞 Paolo Bergantino 4/22/2009

第一个示例中的输出只是匹配链接的文本表示形式，它们实际上是您可以执行各种有趣操作的对象。

1赞 Chas. Owens 4/22/2009

是的，我只是阅读了文档，你只是打败了我来修复代码。我确实添加了 try/catch 以防止它在 href 不存在时爆炸。显然，“链接中的'href'”不起作用。

5赞 3 revs, 2 users 67%runrig #7

语言：Perl 库：HTML：:P arser
目的：如何使用 Perl 正则表达式删除未使用的嵌套 HTML span 标签？

9赞 user80168 #8

语言 Perl
库： HTML：：LinkExtor

Perl 的美妙之处在于你有用于非常具体任务的模块。就像链接提取一样。

整个程序：

#!/usr/bin/perl -w
use strict;

use HTML::LinkExtor;
use LWP::Simple;

my $url     = 'http://www.google.com/';
my $content = get( $url );

my $p       = HTML::LinkExtor->new( \&process_link, $url, );
$p->parse( $content );

exit;

sub process_link {
    my ( $tag, %attr ) = @_;

    return unless $tag eq 'a';
    return unless defined $attr{ 'href' };

    print "- $attr{'href'}\n";
    return;
}

解释：

使用 strict - 打开“严格”模式 - 缓解潜在的调试，但不是完全的与示例相关
使用 HTML：：LinkExtor - 加载有趣的模块
使用 LWP：：Simple - 只是获取一些 html 进行测试的简单方法
my $url = 'http://www.google.com/' - 我们将从哪个页面提取 URL
my $content = get（ $url ） - 获取页面 html
my $p = HTML：：LinkExtor->new（ \&process_link， $url ） - 创建 LinkExtor 对象，引用它将用作每个 url 的回调的函数，并$url用作相对 url 的 BASEURL
$p->parse（ $content ） - 我猜很明显
退出 - 程序结束
sub process_link - 函数开始process_link
my （$tag， %attr） - 获取参数，即标签名称及其属性
除非$tag eq 'a' 否则返回 - 如果标签不是 <a，则跳过处理>
除非定义，否则返回 $attr{'href'} - 如果 <a> 标签没有 href 属性，则跳过处理
打印 “- $attr{'href'}\n”;- 很明显，我猜:)
返回;- 完成函数

就这样。

0赞 Chas. Owens 4/22/2009

很好，但我认为你错过了问题的重点，这个例子是为了让代码相似，而不是因为我想要链接。用更笼统的术语思考。目标是为人们提供使用解析器而不是正则表达式的工具。

5赞 4/22/2009

我可能遗漏了一些东西，但我在问题描述中读到：“为了保持一致性，我要求该示例解析锚标记中 href 的 HTML 文件。例如，如果您要求解析 <td> 标签 - 我可能会使用 HTML：：TableExtract - 基本上 - 专用工具胜过（在我看来）通用工具。

0赞 Chas. Owens 4/22/2009

很好，找到所有具有类“to_understand_intent”的 span 标签，这些标签位于类为“learn”的 div 标签内。专用工具很棒，但它们就是这样：专用。总有一天，您将需要了解通用工具。这是一个关于通用工具的问题，而不是使用这些工具的专用库。

4赞 4/22/2009

对于这个新请求 - 当然 HTML：:P arser 会好得多。但仅仅说“使用 HTML：:P arser”是完全错误的。应该为给定的任务使用适当的工具。对于提取 hrefs，我会说使用 HTML：:P arser 是矫枉过正的。也用于提取 <td>s。问“给我一般的解析方法......”是错误的，因为它假设存在 1 个适用于所有情况的工具（语言）。我个人至少以 6 种不同的方式解析 HTML，具体取决于我需要做什么。

0赞 Chas. Owens 4/22/2009

再看一遍任务。任务不是获取 HTMl 页面中的链接，而是以获取 HTML 页面中的链接为例来演示您最喜欢的解析器的工作原理。之所以选择它，是因为这是一项简单的任务，涉及找到正确的标签并查看其中的一段数据。之所以选择它，也是因为它是一项共同的任务。因为这是一个常见的任务，Perl已经为你自动化了它，但这并不意味着这个问题要求你给出自动答案。

25赞 2 revs, 2 users 95%alexn #9

语言：C#
库：HtmlAgilityPack

class Program
{
    static void Main(string[] args)
    {
        var web = new HtmlWeb();
        var doc = web.Load("http://www.stackoverflow.com");

        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");

        foreach (var node in nodes)
        {
            Console.WriteLine(node.InnerHtml);
        }
    }
}

4赞 3 revs, 3 users 69%Ward Werbrouck #10

语言： JavaScript
库： DOM

var links = document.links;
for(var i in links){
    var href = links[i].href;
    if(href != null) console.debug(href);
}

（使用 firebug console.debug 进行输出...

29赞 5 revs, 3 users 84%Ward Werbrouck #11

语言： JavaScript
库： jQuery

$.each($('a[href]'), function(){
    console.debug(this.href);
});

（使用 firebug console.debug 进行输出...

并加载任何html页面：

$.get('http://stackoverflow.com/', function(page){
     $(page).find('a[href]').each(function(){
        console.debug(this.href);
    });
});

为此使用了另一个每个函数，我认为在链接方法时它更干净。

0赞 Ward Werbrouck 4/22/2009

嗯，是的，如果你这样看的话。:)但是使用 javascript/jquery 解析 HTML 感觉非常自然，非常适合这样的事情。

0赞 Chas. Owens 4/22/2009

使用浏览器作为解析器是最终的解析器。给定浏览器中的 DOM 是文档树。

4赞 2 revs, 2 users 90%zigzag #12

语言：C#
库：系统 .XML（标准 .NET）

using System.Collections.Generic;
using System.Xml;

public static void Main(string[] args)
{
    List<string> matches = new List<string>();

    XmlDocument xd = new XmlDocument();
    xd.LoadXml("<html>...</html>");

    FindHrefs(xd.FirstChild, matches);
}

static void FindHrefs(XmlNode xn, List<string> matches)
{
    if (xn.Attributes != null && xn.Attributes["href"] != null)
        matches.Add(xn.Attributes["href"].InnerXml);

    foreach (XmlNode child in xn.ChildNodes)
        FindHrefs(child, matches);
}

0赞 Chas. Owens 4/22/2009

如果 HTML 不是有效的 xml（例如未关闭的 img 标签），这会起作用吗？

3赞 3 revs, 3 users 80%Adam #13

语言：Python
库：lxml.html

import lxml.html

html = "<html><body>"
for link in ("foo", "bar", "baz"):
    html += '<a href="http://%s.com">%s</a>' % (link, link)
html += "</body></html>"

tree = lxml.html.document_fromstring(html)
for element, attribute, link, pos in tree.iterlinks():
    if attribute == "href":
        print link

lxml 还有一个用于遍历 DOM 的 CSS 选择器类，这使得使用它与使用 JQuery 非常相似：

for a in tree.cssselect('a[href]'):
    print a.get('href')

0赞 Chas. Owens 4/22/2009

嗯，当我尝试运行它时，我收到“ImportError：No module named html”，除了python-lxml之外，我还需要什么吗？

0赞 Chas. Owens 4/22/2009

啊，我有 1.3.6 版，2.0 及更高版本附带

0赞 Adam 4/22/2009

事实上。如果您愿意，我也可以提供一个使用 lxml.etree 来完成这项工作的示例？lxml.html 对损坏的 HTML 的容忍度更高一些。

5赞 2 revslaz #14

语言： Java
库： XOM， TagSoup

我在此示例中包含了故意格式错误且不一致的 XML。

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Element;
import nu.xom.Node;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Parser parser = new Parser();
        parser.setFeature(Parser.namespacesFeature, false);
        final Builder builder = new Builder(parser);
        final Document document = builder.build("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>", null);
        final Element root = document.getRootElement();
        final Nodes links = root.query("//a[@href]");
        for (int linkNumber = 0; linkNumber < links.size(); ++linkNumber) {
            final Node node = links.get(linkNumber);
            System.out.println(((Element) node).getAttributeValue("href"));
        }
    }
}

默认情况下，TagSoup 会向文档添加引用 XHTML 的 XML 命名空间。我选择在此示例中抑制它。使用默认行为将要求调用包含命名空间，如下所示：root.query

root.query("//xhtml:a[@href]", new nu.xom.XPathContext("xhtml", root.getNamespaceURI())

0赞 laz 4/22/2009

我相信任何一个都可以正常工作。TagSoup 是为了解析你能扔给它的任何内容而制作的。

20赞 4 revs, 3 users 91%draegtun #15

语言： Perl
库： pQuery

use strict;
use warnings;
use pQuery;

my $html = join '',
    "<html><body>",
    (map { qq(<a href="http://$_.com">$_</a>) } qw/foo bar baz/),
    "</body></html>";

pQuery( $html )->find( 'a' )->each(
    sub {  
        my $at = $_->getAttribute( 'href' ); 
        print "$at\n" if defined $at;
    }
);

1赞 4/22/2009

这太棒了。从来不知道pQuery，但它看起来很酷。

0赞 Ward Werbrouck 4/22/2009

您可以像在jQuery中一样搜索“a[@href]”或“a[href]”吗？这将简化代码，并且肯定会更快。

1赞 draegtun 4/22/2009

以下是一些带有 pQuery 答案的其他一些堆栈溢出问题...stackoverflow.com/questions/713827/...... stackoverflow.com/questions/574199/...... stackoverflow.com/questions/254345/...... stackoverflow.com/questions/221091/......

0赞 draegtun 4/22/2009

@code-is-art：很遗憾还没有......引用文档中的作者的话：“选择器语法仍然非常有限。（仅限单个标记、ID 和类）”。检查测试，因为 pQuery 确实具有文档中没有的功能，例如。说 'Number of <td> with 'blah“ content - '， pQuery（'td：contains（blah）'）->size;

8赞 2 revs, 2 users 81%Dan Glegg #16

语言： Ruby
图书馆： Nokogiri

#!/usr/bin/env ruby
require 'nokogiri'
require 'open-uri'

document = Nokogiri::HTML(open("http://google.com"))
document.css("html head title").first.content
=> "Google"
document.xpath("//title").first.content
=> "Google"

3赞 5 revs, 3 users 51%Chas. Owens #17

语言： Perl
库： HTML：：TreeBuilder

use strict;
use HTML::TreeBuilder;
use LWP::Simple;

my $content = get 'http://www.stackoverflow.com';
my $document = HTML::TreeBuilder->new->parse($content)->eof;

for my $a ($document->find('a')) {
    print $a->attr('href'), "\n" if $a->attr('href');
}

0赞 Chas. Owens 4/22/2009

这也是不正确的，您必须致电 $document->eof;如果使用 $document->parse（$html）;并在未设置 href 时打印空行。

0赞 dfa 4/22/2009

恢复到我原来的代码;->eof（）在此示例中无用;在此示例中，检查 href 是否存在也毫无意义

0赞 Chas. Owens 4/23/2009

你不想使用new_from_content有什么原因吗？

4赞 4 revs, 2 users 81%Ward Werbrouck #18

语言：PHP
库：SimpleXML（和 DOM）

<?php
$page = new DOMDocument();
$page->strictErrorChecking = false;
$page->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xml = simplexml_import_dom($page);

$links = $xml->xpath('//a[@href]');
foreach($links as $link)
    echo $link['href']."\n";

3赞 5 revs, 2 users 92%Alex Reynolds #19

语言：Objective-C
库：libxml2 + Matt Gallagher 的 libxml2 包装器 + Ben Copsey 的 ASIHTTPRequest

ASIHTTPRequest *request = [ASIHTTPRequest alloc] initWithURL:[NSURL URLWithString:@"http://stackoverflow.com/questions/773340"];
[request start];
NSError *error = [request error];
if (!error) {
    NSData *response = [request responseData];
    NSLog(@"Data: %@", [[self query:@"//a[@href]" withResponse:response] description]);
    [request release];
}
else 
    @throw [NSException exceptionWithName:@"kMyHTTPRequestFailed" reason:@"Request failed!" userInfo:nil];

...

- (id) query:(NSString *)xpathQuery WithResponse:(NSData *)resp {
    NSArray *nodes = PerformHTMLXPathQuery(resp, xpathQuery);
    if (nodes != nil)
        return nodes;
    return nil;
}

8赞 3 revs, 2 users 96%dmitry_vk #20

语言： Common Lisp
库： Closure Html， Closure Xml， CL-WHO

（使用 DOM API 显示，不使用 XPATH 或 STP API）

(defvar *html*
  (who:with-html-output-to-string (stream)
    (:html
     (:body (loop
               for site in (list "foo" "bar" "baz")
               do (who:htm (:a :href (format nil "http://~A.com/" site))))))))

(defvar *dom*
  (chtml:parse *html* (cxml-dom:make-dom-builder)))

(loop
   for tag across (dom:get-elements-by-tag-name *dom* "a")
   collect (dom:get-attribute tag "href"))
=> 
("http://foo.com/" "http://bar.com/" "http://baz.com/")

0赞 Chas. Owens 4/28/2009

collect 或 dom：get-attribute 是否正确处理未设置 href 的标签？

2赞 dmitry_vk 4/28/2009

取决于正确性的定义。如图所示，将为没有“href”属性的“a”标签收集空字符串。如果 loop 被重写为（loop for tag across （dom：get-elements-by-tag-name， dom “a”） when （string/= （dom：get-attribute tag “href”） “”） collect （dom：get-attribute tag “href”）），则只会收集非空的 “href”。

0赞 dmitry_vk 4/28/2009

实际上，这不是当（string/= （dom：get-attribute tag “href”） “”）而是当（dom：has-attribute tag “href”）

0赞 davorb 2/17/2013

如果没有循环宏，你会怎么做？

6赞 Michał Marczyk #21

语言： Clojure
库： Enlive （Clojure 的一个基于选择器的（à la CSS）模板和转换系统）

选择器表达式：

(def test-select
     (html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))

现在我们可以在 REPL 中执行以下操作（我在其中添加了换行符）：test-select

user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
 {:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
 {:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")

您需要满足以下条件才能试用：

序言：

(require '[net.cgrand.enlive-html :as html])

测试 HTML：

(def test-html
     (apply str (concat ["<html><body>"]
                        (for [link ["foo" "bar" "baz"]]
                          (str "<a href=\"http://" link ".com/\">" link "</a>"))
                        ["</body></html>"])))

0赞 Michał Marczyk 3/6/2010

不确定我是否应该将 Enlive 称为“解析器”，但我肯定会用它来代替解析器，所以——这里有一个例子。

1赞 Entea #22

语言： PHP 库： DOM

<?php
$doc = new DOMDocument();
$doc->strictErrorChecking = false;
$doc->loadHTMLFile('http://stackoverflow.com/questions/773340');
$xpath = new DOMXpath($doc);

$links = $xpath->query('//a[@href]');
for ($i = 0; $i < $links->length; $i++)
    echo $links->item($i)->getAttribute('href'), "\n";

有时，将符号放在前面以抑制无效的 html 解析警告很有用@$doc->loadHTMLFile

0赞 Ward Werbrouck 10/29/2010

与我的PHP版本几乎相同（stackoverflow.com/questions/773340/... ）您不需要 getAttribute 调用

1赞 seagulf #23

语言： Python
库： HTQL

import htql; 

page="<a href=a.html>1</a><a href=b.html>2</a><a href=c.html>3</a>";
query="<a>:href,tx";

for url, text in htql.HTQL(page, query): 
    print url, text;

简单直观。

1赞 3 revslaz #24

语言： Java
库： jsoup

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final Document document = Jsoup.parse("<html><body><ul><li><a href=\"http://google.com\">google</li><li><a HREF=\"http://reddit.org\" target=\"_blank\">reddit</a></li><li><a name=\"nothing\">nothing</a><li></ul></body></html>");
        final Elements links = document.select("a[href]");
        for (final Element element : links) {
            System.out.println(element.attr("href"));
        }
    }
}

4赞 2 revs, 2 users 59%Ryan Culpepper #25

语言： Racket

库：（planet ashinn/html-parser：1）和（planet clements/sxml2：1）

(require net/url
         (planet ashinn/html-parser:1)
         (planet clements/sxml2:1))

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))

上面使用新包系统中的包的示例：html-parsing 和 sxml

(require net/url
         html-parsing
         sxml)

(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))

注意：从命令行使用“raco”安装所需的软件包，并具有以下命令：

raco pkg install html-parsing

和：

raco pkg install sxml

1赞 2 revsthe Tin Man #26

语言： Ruby
库： Nokogiri

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://www.example.com'))
hrefs = doc.search('a').map{ |n| n['href'] }

puts hrefs

其中输出：

/
/domains/
/numbers/
/protocols/
/about/
/go/rfc2606
/about/
/about/presentations/
/about/performance/
/reports/
/domains/
/domains/root/
/domains/int/
/domains/arpa/
/domains/idn-tables/
/protocols/
/numbers/
/abuse/
http://www.icann.org/
mailto:[email protected]?subject=General%20website%20feedback

这是对上述内容的一个小改动，从而产生可用于报表的输出。我只返回 hrefs 列表中的第一个和最后一个元素：

#!/usr/bin/env ruby

require "nokogiri"
require "open-uri"

doc = Nokogiri::HTML(open('http://nokogiri.org'))
hrefs = doc.search('a[href]').map{ |n| n['href'] }

puts hrefs
  .each_with_index                     # add an array index
  .minmax{ |a,b| a.last <=> b.last }   # find the first and last element
  .map{ |h,i| '%3d %s' % [1 + i, h ] } # format the output

  1 http://github.com/tenderlove/nokogiri
100 http://yokolet.blogspot.com

0赞 jGc #27

使用 phantomjs，将此文件保存为 extract-links.js：

var page = new WebPage(),
    url = 'http://www.udacity.com';

page.open(url, function (status) {
    if (status !== 'success') {
        console.log('Unable to access network');
    } else {
        var results = page.evaluate(function() {
            var list = document.querySelectorAll('a'), links = [], i;
            for (i = 0; i < list.length; i++) {
                links.push(list[i].href);
            }
            return links;
        });
        console.log(results.join('\n'));
    }
    phantom.exit();
});

跑：

$ ../path/to/bin/phantomjs extract-links.js

0赞 chewymole #28

语言： Coldfusion 9.0.1+

库： jSoup

<cfscript>
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
    var s={};s.href= links[a].attr('href'); s.text= links[a].text(); 
    if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s); 
}
return res; 
}   

//writeoutput(writedump(parseURL(url)));
</cfscript>
<cfdump var="#parseURL("http://stackoverflow.com/questions/773340/can-you-provide-examples-of-parsing-html")#">

返回一个结构数组，每个结构都包含一个 HREF 和 TEXT 对象。

0赞 2 revsGabaGabaDev #29

语言：JavaScript/Node.js

图书馆：Request 和 Cheerio

var request = require('request');
var cheerio = require('cheerio');

var url = "https://news.ycombinator.com/";
request(url, function (error, response, html) {
    if (!error && response.statusCode == 200) {
        var $ = cheerio.load(html);
        var anchorTags = $('a');

        anchorTags.each(function(i,element){
            console.log(element["attribs"]["href"]);
        });
    }
});

请求库下载 html 文档，Cheerio 允许您使用 jquery css 选择器来定位 html 文档。

上一个：如何在 JavaScript 中从字符串中剥离 HTML 标签？[复制]

下一个：Jsoup Java HTML 解析器：执行 Javascript 事件

你能提供解析 HTML 的示例吗？

Can you provide examples of parsing HTML?

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论

评论