luikn luikn - 7 months ago 27
Perl Question

Convert .sgm to .txt

I have some files in .sgm format and I have to evaluate them (apply a language model and obtain the perplexity of the text).

The main problem is that I need these files in plain format, i.e. in txt format. However I have been searching into the internet for an online convert or for somekind of script doing this and could not find.

Besides this, a teacher of mine sent me this command in perl:

perl -n 'print $1."\n" if /<seg[^>]+>\s*(.*\S)\s*<.seg>/i;’ < file.sgm > file


I have never worked using perl and have, honestly, no idea of it. I think I have perl installed:

$ perl -v

This is perl 5, version 18, subversion 2 (v5.18.2) built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)

Copyright 1987-2013, Larry Wall

Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.

Complete documentation for Perl, including FAQ lists, should be found on
this system using "man perl" or "perldoc perl". If you have access to the
Internet, point your browser at http://www.perl.org/, the Perl Home Page.


By the way, I am using Mac OS X.

Sample .sgm file:

<srcset setid="newsdiscusstest2015" srclang="any">
<doc sysid="ref" docid="39-Guardian" genre="newsdiscuss" origlang="en">
<p>
<seg id="1">This is perfectly illustrated by the UKIP numbties banning people with HIV.</seg>
<seg id="2">You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome.</seg>
<seg id="3">You raise a straw man and then knock it down with thinly veiled homophobia.</seg>


Otuput .txt file:


This is perfectly illustrated by the UKIP numbties banning
people with HIV. You mean Nigel Farage saying the
NHS should not be used to pay for people coming to the UK as health
tourists, and saying yes when the interviewer specifically asked if,
with the aforementioned in mind, people with HIV were included in not
being welcome. You raise a straw man and then knock
it down with thinly veiled homophobia.

Answer

You can try using this script to strip the SGML tags from the file:

#!/usr/bin/env perl
use strict;
use warnings;

use HTML::Parser;

my $file = $ARGV[0];

HTML::Parser->new(default_h => [""],
    text_h => [ sub { print shift }, 'text' ]
  )->parse_file($file) or die "Failed to parse $file: $!";

Use it as follows:

./strip_sgml.pl file.sgm > file.txt
Comments