user3692521 user3692521 - 6 months ago 9
Bash Question

How to Convert two columns of CSV files to consecutive integers?

Hello say I have this file file1.csv and it has 2 columns a and b which are both 22 char strings. It looks like something like this:

hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w
...


And I would like to convert the a, b columns into consecutive integers like:

1,1
1,2
2,3
2,4


Does anyone know how can I do it? I am using Ubuntu 12.04 by the way

And what if I have another file file2.csv with column a' and b'. And is there any way to do the same thing to file2 and if "hWcYwgRKOD77hfm1oKE0IA" is 1 in file1 then "hWcYwgRKOD77hfm1oKE0IA" is 1 in file2 if it appears. Same for column b and b'. And I would like to have single output from those two files: result1.csv and result2.csv

Answer
awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x } 
                    if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y } } 1' file

Or perhaps simpler but may be less efficient:

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } { $1 = a[$1] }
                  !($2 in b) { b[$2] = ++y } { $2 = b[$2] } 1' file

Or dynamic to any number of columns:

awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i)
                        if ((i, $i) in a) { $i = a[i, $i] }
                                     else { $i = a[i, $i] = ++x[i] } } 1' file

Which is also similar to

awk -F, -v OFS=, '{ for (i = 1; i <= NF; ++i) {
                    if (!((i, $i) in a)) a[i, $i] = ++x[i]
                    $i = a[i, $i] } } 1' file

Output:

1,1
1,2
2,3
2,4

UPDATE

To apply on two files, try:

awk -F, -v OFS=, '{ if ($1 in a) { $1 = a[$1] } else { $1 = a[$1] = ++x } 
                    if ($2 in b) { $2 = b[$2] } else { $2 = b[$2] = ++y } 
                    print > "result_" FILENAME }' file1 file2

UPDATE 02

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
                  { print $1, $2, a[$1], b[$2] }' file

Output:

hWcYwgRKOD77hfm1oKE0IA,5HleiJXMsFkGEsr8Jqr3Ug,1,1
hWcYwgRKOD77hfm1oKE0IA,rCDlYd2WHJuiT05sYGxaVA,1,2
65q0c2Iw03B8eSuHHTETHw,G40NUD0/op+13yjzBw+hrw,2,3
65q0c2Iw03B8eSuHHTETHw,1u8UW/cQ4i1vbSF9wvzu3w,2,4

File by file version:

awk -F, -v OFS=, '!($1 in a) { a[$1] = ++x } !($2 in b) { b[$2] = ++y }
                  { print $1, $2, a[$1], b[$2] > "result_" FILENAME }' file1 file2
Comments