Jake Jake - 1 year ago 43
Perl Question

What is the most efficient way to find rows of a CSV not containing duplicate entries across the fields of that row (excluding blank entires)?

I'm looking to find all of the rows of a CSV file that contain the duplicate data in two or more of the fields of that row (i.e. find all of the rows without unique data in each field.)

For example, I have the following CSV file:


I would want the following rows to be printed:


These rows are printed because a the data in one field of these rows occurs in another field. Note that "Mary,,,21" was not printed even though it has duplicate empty fields.

I could write a Python script and keep a count of the times each entry occurs in each row, but it seems like there must be a better way to do this.

Answer Source

Using perl:

perl -F, -lane 'my %s; print if grep { $s{$_}++ } @F'


  • -F, to set field seperator to ,
  • -l to automatically handle linefeeds
  • -a to autosplit
  • -n to wrap it in a while ( <> ) { loop.
  • -e to specify code to exec.

Incoming data is autosplit on , into @F and we use a %s hash to spot if there's a dupe.

If - based on your comment - you need to skip empty fields (which this would count as dupes):

perl -F, -lane 'my %s; print if grep { /./ ? $s{$_}++ : () } @F'

This includes a ternary operator to test if a field is empty.

Testing with Windows (which isn't quite the same, because of quotes):

C:\Users\me>perl -F, -lane "my %s; print qq{line matches:$_} if grep { /./ ? $s{$_}++ : () } @F"
line matches:John,Smith,Smith,21
line matches:John,42,42,42