Floran Gmehlin Floran Gmehlin - 11 months ago 41
Linux Question

Delete duplicate lines in two sentence aligned files, Linux

I have a parallel corpora in two files (one in German, the other in English) where sentences are aligned. It means that on each line of a file, the same line in the other file contains its traduction.

However, in the german corpora, some sentences are still in English (or they are just weird tags) for example :

file.en, line 500: The house is small file.de, line 500: Das Haus ist klein
file.en, line 501: The cat is big file.de, line 501: The cat is big
file.en, line 444: EMEA/CVMP/424/01 file.de, line 444: EMEA/CVMP/424/01

As I need to preserve the order of the sentences, I would like to detect such duplicates (
string1 == string2
) and remove them from both files, so that the sentences are still aligned afterward.

I have seen some solutions with
, but none that match my problem.

Any thought ?

NOTE : The files are several million lines big.

123 123

You could use a small perl script which won't need to store anything but the immediate line in memory.

Just compares both files line by line and only prints different ones.

use warnings;
use strict;

open(my $fh1,'<','file');
open(my $fh2,'<','file2');
open(my $fh3,'>','outfile');
open(my $fh4,'>','outfile2');

while (my $line  = <$fh1>){
        my $line2 = <$fh2> ;
        if($line ne $line2){
                print $fh3 $line;
                print $fh4 $line2;

use as

perl script.pl