Philippe Fisher Philippe Fisher - 1 year ago 79
Linux Question

Get list of all duplicates based on first column within large text/csv file in linux/ubuntu

I am trying to extract all the duplicates based on the first column/index of my very large text/csv file (7+ GB / 100+ Million lines). Format is like so:


first column is any lowercase utf-8 string and the second column is any utf-8 string. I have been able to sort my file based on the first column and only the first column with:

sort -t':' -k1,1 filename.txt > output_sorted.txt

I have also been able to drop all duplicates with:

sort -t':' -u -k1,1 filename.txt > output_uniq_sorted.txt

These operations take 4-8 min.

I am now trying to extract all duplicates based on the first column and only the first column, to make sure all entries in the second columns are matching.

I think I can achieve this with
with this code:

BEGIN { FS = ":" }

if (count[$1] == 1){
first[$1] = $0;

if (count[$1] == 2){
print first[$1];

if (count[$1] > 1){
print $0;

running it with:

awk -f awk.dups input_sorted.txt > output_dup.txt

Now the problem is this takes way to long 3+hours and not yet done. I know
can get all duplicates with something like:

uniq -D sorted_file.txt > output_dup.txt

The problem is specifying the delimiter and only using the first column. I know
has a
-f N
to skip the first
fields. Is there a way to get these results without having to change/process my data? Is there another tool the could accomplish this? I have already used python + pandas with read_csv and getting the duplicates but this leads to errors (segmentation fault) and this is not efficient since I shouldn't have to load all the data in memory since the data is sorted. I have decent hardware

  • i7-4700HQ

  • 16GB ram

  • 256GB ssd samsung 850 pro

Anything that can help is welcome,



awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c'

with the command
I get the following performance.

real 0m46.058s
user 0m40.352s
sys 0m2.984s

Answer Source

If your file is already sorted you don't need to store more than one line, try this

$ awk -F: '{if(p!=$1){p=$1; c=0; p0=$0} else c++} c==1{print p0} c' sorted.input

If you try this please post the timings...

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download