JMKS JMKS - 4 months ago 7
Perl Question

Kill doubles within one line, from CSV file

Input file is from group_contacenated SQL query output, where there were some duplicate values. It is already used with DISTINCT, but it was not enough, as only some substrings were identical.

So, the row I'm interested in, is the 9th column.
The idea is, to print on one line only non-duplicate IAB categories.

Sample from that file:

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23,IAB3;IAB11;IAB17;IAB12;IAB9;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,728x90,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x100,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x200,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x300,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB13;IAB23,IAB3;IAB11;IAB12;IAB9"


I want to delete duplicate IAB categories, so for the first line it would go like this:

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23;IAB13;IAB23"


In my SQL query, I have something like this:

SELECT GROUP_CONCAT(DISTINCT foo) FROM t;


Now the foo-column can contain values along of these lines:

foo
bar
qrr
foo;bar
foo;qrr
foo
foo;qrr
bar
qrr
foo


Concatenating those values using
DISTINCT
will remove all direct duplicates. Split up, that's as follows:

foo
bar
qrr
foo;bar
foo;qrr


I am interested in individual values (
foo
,
bar
and
qrr
). If the delimiter used to concat is
;
it will appear as if not all duplicates have been removed.

The final output in that column after it has been concatenated with
;
should be:

foo;bar;baz


How do I remove those duplicates?

I tried to do it, but alas I'm not that advanced in AWK etc.

I'm working with Bash, althought I can do it also "one step earlier", in SQLite.

Answer

As long as the column to be processed is always the only one in double quotes, and it is okay to replace all separators with semicolons, this will do as you ask

use strict;
use warnings 'all';

use List::Util 'uniq';

while ( <> ) {
    s{ " ([^"]+) " }{ '"' . join(';', uniq $1 =~ /\w+/g) . '"' }ex;
    print;
}

output

148422,0.72499999999999998,0.72499999999999998,0.72500000000165021,wpolityce.pl,300x250,standard,3,"IAB3;IAB11;IAB17;IAB12;IAB9;IAB15;IAB23;IAB13"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,728x90,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x100,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x200,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"
118243,0.72499999999999998,0.72499999999999998,0.72500000000058573,wpolityce.pl,750x300,standard,3,"IAB3;IAB11;IAB1;IAB12;IAB13;IAB23;IAB9"