如何在 perl 中解析 HTML5 标签?

How to parse HTML5 tags in perl?

提问人:Nikhil Ranjan 提问时间:3/8/2016 最后编辑:Nikhil Ranjan 更新时间:3/9/2016 访问量:359

问:

我想要解析 HTML5 标签? 当我解析它时,投诉标签。我不希望它给出错误。<section>

错误是缺少“”标记。</section>

我的输入是:-

<?xml version="1.0" encoding="utf-8"?><html xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/1999/xhtml" xmlns:m="http://www.w3.org/1998/Math/MathML" xml:lang="en" lang="en">
<head>
<link rel="stylesheet" type="text/css" title="day" href="../css/main.css"/>
<title>Electric Potential and Electric Potential Energy</title>
<meta charset="UTF-8"/>
<meta name="dcterms.conformsTo" content="PXE 1.39 ProductLevelReuse"/>
<meta name="generator" content="PXE Tools version 1.39.69"/>
</head>
<body>
<section class="chapter" ><header><h1 class="title"><span class="number">20</span> Electric Potential and Electric Potential Energy</h1></header>
<section class="frontmatter">
<section class="listgroup"><header><h1 class="title">Big Ideas</h1></header>
<ol>
<li><p>Electric potential energy is similar to gravitational potential energy.</p></li>
</ol>
</section>
</section>
</body>
</html>

我的代码是:-

use warnings ;
use strict;
use HTML::Tidy;
my $file_name ="d:/perl/test.xhtml";
undef $/;
open xhtml_file, '<:encoding(UTF-8)', "$file_name" || die "no htm file found $!";
my $contents = <xhtml_file>;
close (xhtml_file);
$/ = "\n";

my $tidy = HTML::Tidy->new();
$tidy->ignore(
                text => qr/DOCTYPE/,
                text => qr/html/,
                text => qr/meta/,
                text => qr/header/
);
$tidy->parse( "foo.html", $contents );
for my $message ( $tidy->messages )
    {
        print $message->as_string, "\n";
    }

错误日志是:-

foo.html (10:1) Error: <section> is not recognized!
foo.html (10:1) Warning: discarding unexpected <section>
foo.html (11:1) Error: <section> is not recognized!
foo.html (11:1) Warning: discarding unexpected <section>
foo.html (12:1) Error: <section> is not recognized!
foo.html (12:1) Warning: discarding unexpected <section>
foo.html (16:1) Warning: discarding unexpected </section>
foo.html (17:1) Warning: discarding unexpected </section>

我该如何解决?

html perl html 解析

评论

0赞 simbabque 3/8/2016
解析后你想用它做什么?
5赞 choroba 3/8/2016
交叉张贴
2赞 ikegami 3/8/2016
对于初学者来说,HTML::Tiny 用于解析 HTML,而不是 XHTML(“HTML 的 XML 序列化”)
3赞 ikegami 3/8/2016
其次,关于失踪的错误是合理的。该 XML 的格式不正确,因为您有一个未闭合的“section”元素。如果你使用了XML解析器,它肯定会抛出一个错误。如果您不希望它这样做,则应修复该错误。</section>
2赞 dgw 3/8/2016
来自 www.tidyp.com:tidyp 是一个程序,可以验证您的 HTML,并对其进行修改以使其更加干净和标准。tidyp 不验证 HTML 5

答:

0赞 dasgar 3/9/2016 #1

根据其文档,HTML::Valid 模块基于 www.html-tidy.org,并且确实支持 HTML5。看起来它会为您提供您在 PerlMonks 这篇文章中提到的行号和列号。