ggupta ggupta - 3 years ago 360
Bash Question

Get unique values of every column from a gz file

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.

import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with gzip.open("fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
try:
unique.add(line[i])
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
break
except:
continue
if flag == 0: print unique


I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.

any shell script solution?

for example i have the data in my file as

5C4423,COMP,ISIN,CA2372051094,2016-04-19,
41C528,COMP,ISIN,US2333774071,2000-01-01,
B62545,COMP,ISIN,NL0000344265,2000-01-01,2007-05-11
9E7F41,COMP,ISIN,CA39260W1023,2013-02-13,2013-08-09
129DC8,COMP,ISIN,US37253A1034,2012-09-07,
4DE8CD,COMP,ISIN,QA000A0NCQB1,2008-03-06,


and in want all unique values from each column.

Answer Source

With the gunzipped file, you could do:

awk -F, 'END { for (i=1;i<=NF;i++) { print  "cut -d\",\" -f "i" filename | uniq" } }' filename | sh

Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download