ggupta ggupta - 3 years ago 360
Bash Question

Get unique values of every column from a gz file

I have a gz file, and i want to extract the unique values from each column from the file, field separator is |, i tried using python as below.

import sys,os,csv,gzip
from sets import Set
ig = 0
max_d = 1
with"fundamentals.20170724.gz","rb") as f:
reader = csv.reader(f,delimiter="|")
for i in range(0,400):
unique = Set()
print "Unique_value for column "+str(i+1)
flag = 0
for line in reader:
max_d +=1
if len(unique) >= 10:
print unique
flag = 1
if flag == 0: print unique

I don't find it efficient for large files, although it is working somehow, but seeking this problems from bash point of view.

any shell script solution?

for example i have the data in my file as


and in want all unique values from each column.

Answer Source

With the gunzipped file, you could do:

awk -F, 'END { for (i=1;i<=NF;i++) { print  "cut -d\",\" -f "i" filename | uniq" } }' filename | sh

Set the field separator to , and then for each field in the file, construct a cut command piping through uniq and finally pipe the whole awk response through sh. The use of cut, uniq and sh will slow things down and there is probably a more efficient way but it's worth a go.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download