user3150815 user3150815 - 2 months ago 16
Bash Question

Fastest way to extract a column and then find its uniq items in a large delimited file

Hoping for help. I have a 3 million line file, data.txt, delimited with "|", e.g,.


I need to extract the 3rd column ("DE") and then limit it to its unique values. Here is what I've come up with (gawk and gsort as I'm running MacOS and only had the "--parallel" option via GNU sort):

gawk -F "|" '{print $3}' data.txt \
| gsort --parallel=4 -u > countries.uniq

This works, but it isn't very fast. I have similar tasks coming up with some even larger (11M record) files, so I'm wondering if anyone can point out a faster way.

I hope to stay in shell, rather than say, Python, because some of the related processing is much easier done in shell.

Many thanks!


awk is tailor-made for such tasks. Here is a minimal awk logic that could do the trick for you.

awk -F"|" '!($3 in arr){print} {arr[$3]++} END{ for (i in arr) print i}' logFile

The logic is as awk processes every line, it adds the entry of the value in $3 only if it has not seen it before. The above prints both unique lines followed by unique entries from $3

If you want the unique lines only, you can exclude the END() clause

awk -F"|" '!($3 in arr){print} {arr[$3]++}' logFile > uniqueLinesOnly

If you want unique values only from the file remove the inside print

awk -F"|" '!($3 in arr){arr[$3]++} END{ for (i in arr) print i}' logFile > uniqueEntriesOnly

You can see how fast it is for a 11M record entry file. You can write it a new file using the redirect operator