user3150815 user3150815 - 5 days ago 5
Bash Question

Fastest way to extract a column and then find its uniq items in a large delimited file

Hoping for help. I have a 3 million line file, data.txt, delimited with "|", e.g,.

"4"|"GESELLSCHAFT FUER NUCLEONIC & ELECT MBH"|"DE"|"0"
"5"|"IMPEX ESSEN VERTRIEB VON WERKZEUGEN GMBH"|"DE"|"0"


I need to extract the 3rd column ("DE") and then limit it to its unique values. Here is what I've come up with (gawk and gsort as I'm running MacOS and only had the "--parallel" option via GNU sort):

gawk -F "|" '{print $3}' data.txt \
| gsort --parallel=4 -u > countries.uniq


This works, but it isn't very fast. I have similar tasks coming up with some even larger (11M record) files, so I'm wondering if anyone can point out a faster way.

I hope to stay in shell, rather than say, Python, because some of the related processing is much easier done in shell.

Many thanks!

Answer

awk is tailor-made for such tasks. Here is a minimal awk logic that could do the trick for you.

awk -F"|" '!($3 in arr){print} {arr[$3]++} END{ for (i in arr) print i}' logFile

The logic is as awk processes every line, it adds the entry of the value in $3 only if it has not seen it before. The above prints both unique lines followed by unique entries from $3

If you want the unique lines only, you can exclude the END() clause

awk -F"|" '!($3 in arr){print} {arr[$3]++}' logFile > uniqueLinesOnly

If you want unique values only from the file remove the inside print

awk -F"|" '!($3 in arr){arr[$3]++} END{ for (i in arr) print i}' logFile > uniqueEntriesOnly

You can see how fast it is for a 11M record entry file. You can write it a new file using the redirect operator

Comments