jan jan - 6 months ago 8
Perl Question

Replace repetitions of specific text with perl script

I have a simple perl script that does a number of text replacements along the following lines:

#!/usr/bin/perl
{
open(my $in, "<", "Texts.txt") or die "No input: $!";
open(my $out, ">", "TeXed/Texts.tex") or die "No output directory: $!";
LINE: while (<$in>) {
s/(txt@)(.*)(?<!\t|\[)\[(.*)/\1\2\\ovl{}\3/g;#
# there are a bunch of other replacements like the above
print $out $_ ;
}
}


So far so good. The text I am running this script on is organized into blocks (not always of the same length). Each block starts with the same identifier (txt@) and then a unique label. Every label starts with a #.

What I would like to achieve is to delete all repeated labels – essentially I only want to keep every first instance of a label and replace/delete all subsequent ones until the label changes. In the below example the ones to be replaced/deleted are in bold.

txt@#Label1 some text

some more text

some more text

txt@#Label1 some other text

some more text

some more text

some more text

txt@#Label1 some random text

some more text

some more text

txt@#Label2 some text

some more text

some more text

some more text

txt@#Label1 some text

some more text

some more text

txt@#Label3 some text

some more text

some more text

txt@#Label3 some text

some more text

some more text

txt@#Label1 some text

some more text

some more text

etc.

Sorry for the long example – I couldn't come up with a better way of explaining this.

So I want to delete all repeated Label1, Label2, etc., but without modifying the rest of the text (some text, some more text) both on the same line but also on subsequent lines. The number of subsequent lines is not always the same (so it's not every n-th line that has to be replaced).

Is that possible with perl? Or any other way? (I'm not married to perl, if it is easier with another language I'd be happy to try that – I'm not a programmer though so detailed instructions would be highly appreciated).

Answer

Read all blocks into an array, one block per element. Then iterate through it, keeping track of the 'current label' -- the latest one picked up. Once a line with a label comes up compare it -- if it is the same it repeats so delete it, otherwise replace and we have the new 'current' one.

use warnings;
use strict;

open my $fh_in, '<', 'text_label.txt';
open my $fh_out, '>', 'new_text_label.txt';

my @blocks;    
# Set record separator to empty line to read blocks
READ: { 
    local $/ = "\n\n";
    while (my $bl = <$fh_in>) { push @blocks, $bl }   
};

# Our current (running) label
my $curr_label = '';

foreach my $bl (@blocks) 
{
    # If line with label fetch it otherwise (print and) skip
    my ($label) = $bl =~ m/txt@#(\w+)/;
    if (not $label) {
        # process non-label line as needed
        print $fh_out $bl;
        next;
    }       
    # Delete if repeated (matching the current), reset if new
    if ($curr_label eq $label) {
        $bl =~ s/(txt@)(?:#\w+)(.*)/$1$2/;
    }   
    else {
        $curr_label = $label;
    }   
    # process label-line as needed
    print $fh_out $bl;
}

This yields the required file.

The processing of the label-line can be organized differently so that it may be easier to do other processing.

foreach my $bl (@blocks) 
{
     # Process the label-line: delete or replace the label
     if (my $label = $bl =~ m/txt@#(\w+)/) {
        # Delete if repeated (matching the current), reset if new
        if ($curr_label eq $label) {
            $bl =~ s/(txt@)(?:#\w+)(.*)/$1$2/;
        }   
        else {
            $curr_label = $label;
        }
     }
     # Label fixed as needed. Process normally ...
}
Comments