jan jan - 6 months ago 10
Perl Question

Perl script search/replace and transform results



I am running a simple Perl script which duplicates all lines starting with

\txt
to
\xtx
. So far so good.



use strict;
use warnings;

$^I = '.bak';

while ( <> ) {

s/(\\txt )(.*)/$1$2\n\\xtx $2/g;

print;
}


Now I would like to "scrub" all the new lines starting with
\\xtx
and


  1. Delete all non-word characters: any character that is non-alphabetic but keeping characters with diacritics

  2. Convert everything to lower case.



And that's where my rudimentary programming skills end

My text file looks like this:

\txt Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics
\abc More text ...


My script so far produces:

\txt Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics
\xtx Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics
\abc More text ...


And I would like to achieve:

\txt Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics
\xtx text with symbols and numbers and cháractẽrs with diacrítics
\abc More text ...


Any help much appreciated!

EDIT:

Here's a real example string:

\_sh v3.0 400 Text3

\ref 2013-05-01_08.36.14 001
\txt Djawy (.) de osẽ[ma (.2) EDJu::
\fts Te equivocaste, saliste,
\fte

\ELANParticipant #TBGD
\ELANBegin 00:00:05.367
\ELANEnd 00:00:06.521
\dt 26/May/2016

\ref 2013-05-01_08.36.14 002
\txt [A;;;;;;;;;;;;;
\fts A;;;;;;;;;;;;;
\fte
...


... everything should stay as is, except for the lines starting with \txt ...

sln sln
Answer

You could try this conversion

Perl

use strict;
use warnings;

binmode (DATA, ":utf8");
binmode (STDOUT, ":utf8");

while (<DATA>) {
   s/^(\\txt )(.*)/GetConvetedLine($1,$2)/me;
   print; 
}

sub GetConvetedLine
{
    my ($txt, $body) = @_;
    my $newbody = $body;
    $newbody =~ s/[^\pL\s]+//g;
    $newbody =~ s/\s+/ /g;
    $newbody = lc($newbody);
    return $txt . $body . "\n" . "\\xtx " . $newbody;
}


__DATA__    
\txt Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics

Output

\txt Text (.) with [ symbols and Num[bers (.2) and cháractẽrs with diacrítics
\xtx text with symbols and numbers and cháractẽrs with diacrítics
Comments