Rick Rick - 1 month ago 11
Perl Question

How can I extract the text from inside multiple sets of HTML tags?

I have a batch of text files from which I am trying to remove HTML tags. The text that I want preserved in each file is between

<TEXT>
and
</TEXT>
. In some of these files, there is a second instance of
<TEXT>
and
</TEXT>
in the bottom half of the document that I want preserved as well.

HTML::Restrict works great for preserving all relevant text in the first instance, but it doesn't seem to preserve the text between the second instance of
<TEXT>
and
</TEXT>
.

My code is:

$hr = HTML::Restrict->new() ;
$processed = $hr->process($doc) ;


I can't discern any options within the HTML::Restrict module that I can tweak to ensure that the second part of the text file is preserved. Do such options exist, or is there a better way to accomplish this task? I've tried some regex, but so far I've run into a similar problem with that as well.

Below is the original file. The resulting output is everything the first instance of
<TEXT>
(immediately above "UNITED STATES") and the first instance of
</TEXT>
in the third grey box from the bottom.

-----BEGIN PRIVACY-ENHANCED MESSAGE-----
Proc-Type: 2001,MIC-CLEAR
Originator-Name: webmaster@www.sec.gov
Originator-Key-Asymmetric:
MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen
TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB
MIC-Info: RSA-MD5,RSA,
VlTZCBM7TRNLONv/I0OgPsjKD23uR2Zn9/jJ4XrBQY8DlPxfH2+iX+W5TZjhZEQY
shGRyuAw29phAaxb1IPhgQ==

<SEC-DOCUMENT>0001157523-06-001366.txt : 20060209
<SEC-HEADER>0001157523-06-001366.hdr.sgml : 20060209
<ACCEPTANCE-DATETIME>20060209161745
ACCESSION NUMBER: 0001157523-06-001366
CONFORMED SUBMISSION TYPE: 8-K
PUBLIC DOCUMENT COUNT: 2
CONFORMED PERIOD OF REPORT: 20060209
ITEM INFORMATION: Results of Operations and Financial Condition
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20060209
DATE AS OF CHANGE: 20060209

FILER:

COMPANY DATA:
COMPANY CONFORMED NAME: ANALOG DEVICES INC
CENTRAL INDEX KEY: 0000006281
STANDARD INDUSTRIAL CLASSIFICATION: SEMICONDUCTORS & RELATED DEVICES [3674]
IRS NUMBER: 042348234
STATE OF INCORPORATION: MA
FISCAL YEAR END: 1205

FILING VALUES:
FORM TYPE: 8-K
SEC ACT: 1934 Act"
SEC FILE NUMBER: 001-07819
FILM NUMBER: 06593279

BUSINESS ADDRESS:
STREET 1: ONE TECHNOLOGY WAY
CITY: NORWOOD
STATE: MA
ZIP: 02062
BUSINESS PHONE: 7813294700

MAIL ADDRESS:
STREET 1: ONE TECHNOLOGY WAY
CITY: NORWOOD
STATE: MA
ZIP: 02062
</SEC-HEADER>
<DOCUMENT>
<TYPE>8-K
<SEQUENCE>1
<FILENAME>a5077045.txt
<DESCRIPTION>ANALOG DEVICES, INC., 8-K
<TEXT>

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549

FORM 8-K

CURRENT REPORT
Pursuant to Section 13 OR 15(d) of The Securities Exchange Act of 1934


Date of Report (Date of earliest event reported): February 9, 2006

Analog Devices, Inc.
- --------------------------------------------------------------------------------
(Exact name of registrant as specified in its charter)

Massachusetts 1-7819 04-2348234
- --------------------------------------------------------------------------------
(State or other juris- (Commission (IRS Employer
diction of incorporation File Number) Identification No.)


One Technology Way, Norwood, MA 02062
- --------------------------------------------------------------------------------
(Address of principal executive offices) (Zip Code)


Registrant's telephone number, including area code: (781) 329-4700


- --------------------------------------------------------------------------------
(Former name or former address, if changed since last report)


Check the appropriate box below if the Form 8-K filing is intended to
simultaneously satisfy the filing obligation of the registrant under any of the
following provisions (see General Instruction A.2. below):

|_| Written communications pursuant to Rule 425 under the Securities Act (17
CFR 230.425)

|_| Soliciting material pursuant to Rule 14a-12 under the Exchange Act (17 CFR
240.14a-12)

|_| Pre-commencement communications pursuant to Rule 14d-2(b) under the
Exchange Act (17 CFR 240.14d-2(b))

|_| Pre-commencement communications pursuant to Rule 13e-4(c) under the
Exchange Act (17 CFR 240.13e-4(c))


<PAGE>


Item 2.02. Results of Operations and Financial Condition

On February 9, 2006, Analog Devices, Inc. announced its financial results
for the quarter ended January 28, 2006. The full text of the press release
issued in connection with the announcement is attached as Exhibit 99.1 to this
Current Report on Form 8-K.

The information in this Form 8-K and the exhibit attached hereto shall not
be deemed "filed" for purposes of Section 18 of the Securities Exchange Act of
1934 (the "Exchange Act") or otherwise subject to the liabilities of that
section, nor shall it be deemed incorporated by reference in any filing under
the Securities Act of 1933 or the Exchange Act, except as expressly set forth by
specific reference in such a filing.



EXHIBIT INDEX

Exhibit No. Description
- ----------- -----------

99.1 Press release dated February 9, 2006 issued by Analog
Devices, Inc.
</TEXT>
</DOCUMENT>
<DOCUMENT>
<TYPE>EX-99.1
<SEQUENCE>2
<FILENAME>a5077045ex99_1.txt
<DESCRIPTION>EXHIBIT 99.1
<TEXT>
Exhibit 99.1


Analog Devices Reports Results for the
First Quarter of Fiscal Year 2006

NORWOOD, Mass.--(BUSINESS WIRE)--Feb. 9, 2006--Analog Devices,
Inc. (NYSE: ADI):

-- Board of Directors declares dividend of $0.12 per share for
the quarter.

-- Financial results for the first quarter and guidance for the
second quarter to be discussed on conference call today at
4:30 pm.

Analog Devices, Inc. (NYSE: ADI), a global leader in
high-performance semiconductors for signal processing applications,
today announced revenue of $621.3 million for the first quarter of
fiscal 2006, an increase of 7% compared to the same period one year
ago and approximately even with the immediately prior quarter's $622.1
million in revenue.



CONTACT: Analog Devices, Inc.
Maria Tagliaferro,781-461-3282
Director of Corporate Communications,
781-461-3491 (fax)
investor.relations@analog.com
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
-----END PRIVACY-ENHANCED MESSAGE-----

Answer

Since you don't really have an HTML document, you want a parser that is not thrown off by various crap thrown at it.

In the example below, I put the sample text above in the __DATA__ section of my script for convenience. In the real world, you should open the file with the appropriate encoding.

#!/usr/bin/env perl

use strict;
use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(handle => \*DATA);

my @text;

while (my $token = $parser->get_token) {
    if ($token->is_start_tag('text')) {
        push @text, $parser->get_text('/text');
    }
}

print "[[[>>>$_<<<]]]\n\n" for @text;

__DATA__
Comments