如何从多组 HTML 标签中提取文本?

How can I extract the text from inside multiple sets of HTML tags?

提问人:Rick 提问时间:10/14/2016 最后编辑:Rick 更新时间:10/16/2016 访问量:132

问:

我有一批文本文件,我正在尝试从中删除 HTML 标签。我希望在每个文件中保留的文本介于 和 之间。在其中一些文件中,我还希望保留文档的第二个实例和下半部分。<TEXT></TEXT><TEXT></TEXT>

HTML::Restrict 非常适合在第一个实例中保留所有相关文本,但它似乎不能保留第二个实例 和 之间的文本。<TEXT></TEXT>

我的代码是:

$hr = HTML::Restrict->new() ;
$processed = $hr->process($doc) ;

我无法辨别 HTML::Restrict 模块中的任何选项,我可以调整这些选项以确保保留文本文件的第二部分。是否存在这样的选项,或者是否有更好的方法来完成此任务?我尝试了一些正则表达式,但到目前为止,我也遇到了类似的问题。

下面是原始文件。生成的输出是 的第一个实例(紧挨着“UNITED STATES”)和第一个实例 of 在底部第三个灰色框中的所有内容。<TEXT></TEXT>

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: [email protected]
Originator-Key-Asymmetric:
 MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
 TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
 VlTZCBM7TRNLONv/I0OgPsjKD23uR2Zn9/jJ4XrBQY8DlPxfH2+iX+W5TZjhZEQY
 shGRyuAw29phAaxb1IPhgQ==

<SEC-DOCUMENT>0001157523-06-001366.txt : 20060209
<SEC-HEADER>0001157523-06-001366.hdr.sgml : 20060209
<ACCEPTANCE-DATETIME>20060209161745
ACCESSION NUMBER:       0001157523-06-001366
CONFORMED SUBMISSION TYPE:  8-K
PUBLIC DOCUMENT COUNT:      2
CONFORMED PERIOD OF REPORT: 20060209
ITEM INFORMATION:       Results of Operations and Financial Condition
ITEM INFORMATION:       Financial Statements and Exhibits
FILED AS OF DATE:       20060209
DATE AS OF CHANGE:      20060209

FILER:

    COMPANY DATA:   
        COMPANY CONFORMED NAME:         ANALOG DEVICES INC
        CENTRAL INDEX KEY:          0000006281
        STANDARD INDUSTRIAL CLASSIFICATION: SEMICONDUCTORS & RELATED DEVICES [3674]
        IRS NUMBER:             042348234
        STATE OF INCORPORATION:         MA
        FISCAL YEAR END:            1205

    FILING VALUES:
        FORM TYPE:      8-K
        SEC ACT:        1934 Act"
        SEC FILE NUMBER:    001-07819
        FILM NUMBER:        06593279

    BUSINESS ADDRESS:   
        STREET 1:       ONE TECHNOLOGY WAY
        CITY:           NORWOOD
        STATE:          MA
        ZIP:            02062
        BUSINESS PHONE:     7813294700

    MAIL ADDRESS:   
        STREET 1:       ONE TECHNOLOGY WAY
        CITY:           NORWOOD
        STATE:          MA
        ZIP:            02062
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K
<SEQUENCE>1
<FILENAME>a5077045.txt
<DESCRIPTION>ANALOG DEVICES, INC., 8-K
<TEXT>

                                  UNITED STATES
                       SECURITIES AND EXCHANGE COMMISSION
                             Washington, D.C. 20549

                                    FORM 8-K

                                 CURRENT REPORT
     Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934


Date of Report (Date of earliest event reported):  February 9, 2006

                              Analog Devices, Inc.
- --------------------------------------------------------------------------------
             (Exact name of registrant as specified in its charter)

      Massachusetts               1-7819                  04-2348234
- --------------------------------------------------------------------------------
 (State or other juris-         (Commission              (IRS Employer
diction of incorporation       File Number)           Identification No.)


     One Technology Way, Norwood, MA                          02062
- --------------------------------------------------------------------------------
(Address of principal executive offices)                    (Zip Code)


Registrant's telephone number, including area code:  (781) 329-4700


- --------------------------------------------------------------------------------
          (Former name or former address, if changed since last report)


Check the appropriate box below if the Form 8-K filing is intended to
simultaneously satisfy the filing obligation of the registrant under any of the
following provisions (see General Instruction A.2. below):

|_|  Written communications pursuant to Rule 425 under the Securities Act (17
     CFR 230.425)

|_|  Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR
     240.14a-12)

|_|  Pre-commencement communications pursuant to Rule 14d-2(b) under the
     Exchange Act (17 CFR 240.14d-2(b))

|_|  Pre-commencement communications pursuant to Rule 13e-4(c) under the
     Exchange Act (17 CFR 240.13e-4(c))


<PAGE>


Item 2.02.  Results of Operations and Financial Condition

     On February 9, 2006, Analog Devices, Inc. announced its financial results
for the quarter ended January 28, 2006. The full text of the press release
issued in connection with the announcement is attached as Exhibit 99.1 to this
Current Report on Form 8-K.

     The information in this Form 8-K and the exhibit attached hereto shall not
be deemed "filed" for purposes of Section 18 of the Securities Exchange Act of
1934 (the "Exchange Act") or otherwise subject to the liabilities of that
section, nor shall it be deemed incorporated by reference in any filing under
the Securities Act of 1933 or the Exchange Act, except as expressly set forth by
specific reference in such a filing.



                                  EXHIBIT INDEX

Exhibit No.                Description
- -----------                -----------

99.1                       Press release dated February 9, 2006 issued by Analog
                           Devices, Inc.
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EX-99.1
<SEQUENCE>2
<FILENAME>a5077045ex99_1.txt
<DESCRIPTION>EXHIBIT 99.1
<TEXT>
                                                                    Exhibit 99.1


                     Analog Devices Reports Results for the
                       First Quarter of Fiscal Year 2006

    NORWOOD, Mass.--(BUSINESS WIRE)--Feb. 9, 2006--Analog Devices,
Inc. (NYSE: ADI):

    --  Board of Directors declares dividend of $0.12 per share for
        the quarter.

    --  Financial results for the first quarter and guidance for the
        second quarter to be discussed on conference call today at
        4:30 pm.

    Analog Devices, Inc. (NYSE: ADI), a global leader in
high-performance semiconductors for signal processing applications,
today announced revenue of $621.3 million for the first quarter of
fiscal 2006, an increase of 7% compared to the same period one year
ago and approximately even with the immediately prior quarter's $622.1
million in revenue.



    CONTACT: Analog Devices, Inc.
             Maria Tagliaferro,781-461-3282
             Director of Corporate Communications,
             781-461-3491 (fax)
             [email protected]
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
-----END PRIVACY-ENHANCED MESSAGE-----
perl html 解析

评论

3赞 simbabque 10/14/2016
编辑您的问题并包括示例输入和输出。
0赞 Rick 10/15/2016
@simbabque 可以,但是当我尝试将原始文件粘贴到问题中时,如何防止示例 HTML 标签“<>”阻止我帖子中的内容?
0赞 ThisSuitIsBlackNot 10/15/2016
@Rick 粘贴代码,突出显示它,然后单击编辑器中的按钮(或按 Ctrl+K)。这会将其视为代码块(您已经使用过一次)。另请参阅格式设置帮助,请注意,您可以通过单击编辑按钮来查看原始 Markdown 的外观。{}
0赞 Rick 10/15/2016
@ThisSuitIsBlackNot 我相信上面的复制和粘贴作业可以传达输入文件的本质。输出是 <TEXT> 的第一个实例和 </TEXT> 之间的所有内容。
0赞 ThisSuitIsBlackNot 10/15/2016
您将其格式化为块引号,而不是代码。它仍然很难阅读,因为某些部分现在有水平滚动条。请格式化为代码,正如我在上一条评论中解释的那样。

答:

1赞 Justin Schell 10/15/2016 #1

这应该给你所有的匹配(我自己测试过):

my @text = $doc =~ /<TEXT>(.*?)<\/TEXT>/gs

评论

0赞 Rick 10/15/2016
我一定在这里犯了一个错误,因为我无法让它为我工作。至少,当我尝试打印@text的内容时,什么都没有打印。
0赞 Rick 10/15/2016
@text = $doc =~ /<文本>(.*?)<\/文本>/gs ;foreach (@text) { print “$_\n” ;
0赞 Justin Schell 10/15/2016
你在那之前声明过吗?如果没有,则需要.如果是这样,您确定 HTML 内容在 ?@textmy @text = ...$doc
0赞 Rick 10/15/2016
你是对的,问题是我在创建该数组之前在$doc中丢失了 <TEXT> 标签。现在问题解决了。重新联接 @text 数组的元素以便我可以打印到输出文件的最佳方法是什么?
0赞 Justin Schell 10/15/2016
如果你只需要一个统一的字符串,你可以做这样的事情my $all_text = join("\n", @text);
2赞 Sinan Ünür 10/16/2016 #2

由于您实际上没有 HTML 文档,因此您需要一个不会被各种废话抛弃的解析器。

在下面的示例中,为了方便起见,我将上面的示例文本放在脚本部分。在现实世界中,您应该使用适当的编码打开文件。__DATA__

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @text;

while (my $token = $parser->get_token) {
    if ($token->is_start_tag('text')) {
        push @text, $parser->get_text('/text');
    }
}

print "[[[>>>$_<<<]]]\n\n" for @text;

__DATA__

评论

0赞 Rick 10/17/2016
我收到错误消息“无法对未定义的值调用方法”get_token”。我已经验证了我的$text变量是否包含我想要清理的原始数据。知道问题可能是什么吗?
0赞 Sinan Ünür 10/17/2016
这意味着施工失败了。如果将文件的内容放在标量中,请使用 See docs$parsermy $parser = HTML::TokeParser::Simple->new(string => $html_string);
1赞 Rick 10/17/2016
是的,当我引用文件而不是标量时,我看到您的原始代码有效。我想我的问题现在已经解决了。谢谢!