Philippe Fisher Philippe Fisher - 1 month ago 9
Linux Question

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:

foo0:bar0
foo1:bar1
foo2:bar2


first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:

sort -t':' -k1,1 filename.txt > output_sorted.txt


I have also been able to drop all duplicates with:

sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt


These operations take 4-8 min.

I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.

I think I can achieve this with
awk
with this code:

BEGIN { FS = ":" }
{
count[$1]++;

if (count[$1] == 1){
first[$1] = $0;
}

if (count[$1] == 2){
print first[$1];
}

if (count[$1] > 1){
print $0;
}
}


running it with:

awk -f awk.dups input_sorted.txt > output_dup.txt


Now the problem is this takes way to long 3+hours and not yet done. I know
uniq
can get all duplicates with something like:

uniq -D sorted_file.txt > output_dup.txt


The problem is specifying the delimiter and only using the first column. I know
uniq
has a
-f N
to skip the first
N
fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware


  • i7-4700HQ

  • 16GB ram

  • 256GB ssd samsung 850 pro



Anything that can help is welcome,
Thanks.

SOLUTION FROM BELOW

Using:

awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'


with the command
time
I get the following performance.

real 0m46.058s
user 0m40.352s
sys 0m2.984s

Answer

If your file is already sorted you don't need to store more than one line, try this

$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input

If you try this please post the timings...