Komal Rathi Komal Rathi - 3 months ago 9
Python Question

Python: Merge two CSV files to multilevel JSON

I am very new to Python/JSON so please bear with me on this. I could do this in R but we need to use Python so as to transform this to Python/Spark/MongoDB. Also, I am just posting a minimal subset - I have a couple more file types and so if anyone can help me with this, I can build upon that to integrate more files and file types:

Getting back to my problem:

I have two tsv input files that I need to merge and convert to JSON. Both the files have gene and sample columns plus some additional columns. However, the

gene
and
sample
may or may not overlap like I have shown - f2.tsv has all genes in f1.tsv but also has an additional gene
g3
. Similarly, both files have overlapping as well as non-overlapping values in
sample
column.

# f1.tsv – has gene, sample and additional column other1

$ cat f1.tsv
gene sample other1
g1 s1 a1
g1 s2 b1
g1 s3a c1
g2 s4 d1

# f2.tsv – has gene, sample and additional columns other21, other22

$ cat f2.tsv
gene sample other21 other22
g1 s1 a21 a22
g1 s2 b21 b22
g1 s3b c21 c22
g2 s4 d21 d22
g3 s5 f21 f22


The gene forms the top level, each gene has multiple samples which form the second level and the additional columns form the
extras
which is the third level.
The extras are divided into two because one file has
other1
and the second file has
other21
and
other22
. The other files that I will include later will have other fields like
other31
and
other32
and so on but they will still have the gene and sample columns.

# expected output – JSON by combining both tsv files.
$ cat output.json
[{
"gene":"g1",
"samples":[
{
"sample":"s2",
"extras":[
{
"other1":"b1"
},
{
"other21":"b21",
"other22":"b22"
}
]
},
{
"sample":"s1",
"extras":[
{
"other1":"a1"
},
{
"other21":"a21",
"other22":"a22"
}
]
},
{
"sample":"s3b",
"extras":[
{
"other21":"c21",
"other22":"c22"
}
]
},
{
"sample":"s3a",
"extras":[
{
"other1":"c1"
}
]
}
]
},{
"gene":"g2",
"samples":[
{
"sample":"s4",
"extras":[
{
"other1":"d1"
},
{
"other21":"d21",
"other22":"d22"
}
]
}
]
},{
"gene":"g3",
"samples":[
{
"sample":"s5",
"extras":[
{
"other21":"f21",
"other22":"f22"
}
]
}
]
}]


How do convert two csv files to a single - multi level JSON based on two common columns?

I would really appreciate any help that I can get on this.

Thanks!

Answer

Here's another option. I tried to make it easy to manage when you start adding more files. You can run on the command line and provide arguments, one for each file you want to add in. Gene/sample names are stored in dictionaries to improve efficiency. The formatting of your desired JSON object is done in each class' format() method. Hope this helps.

import csv, json, sys

class Sample(object):
    def __init__(self, name, extras):
        self.name = name
        self.extras = [extras]

    def format(self):
        map = {}
        map['sample'] = self.name
        map['extras'] = self.extras
        return map

    def add_extras(self, extra):
        #always just add the new extras to the list
        self.extras.append(extra)

class Gene(object):
    def __init__(self, name, samples):
        self.name = name
        self.samples = samples

    def format(self):
        map = {}
        map ['gene'] = self.name
        map['samples'] = sorted([self.samples[sample_key].format() for sample_key in self.samples], key=lambda sample: sample['sample'])
        return map

    def create_or_add_samples(self, new_samples):
        # loop through new samples, seeing if they already exist in the gene object
        for sample_name in new_samples:
            sample = new_samples[sample_name]
            if sample.name in self.samples:
                self.samples[sample.name].add_extras(sample.extras)
            else:
                self.samples[sample.name] = sample

class Genes(object):
    def __init__(self):
        self.genes = {}

    def format(self):
        return sorted([self.genes[gene_name].format() for gene_name in self.genes], key=lambda gene: gene['gene'])

    def create_or_add_gene(self, gene):
        if not gene.name in self.genes:
            self.genes[gene.name] = gene
        else:
            self.genes[gene.name].create_or_add_samples(gene.samples)

def row_to_gene(headers, row):
    gene_name = ""
    sample_name = ""
    extras = {}
    for value in enumerate(row):
        if headers[value[0]] == "gene":
            gene_name = value[1]
        elif headers[value[0]] == "sample":
            sample_name = value[1]
        else:
            extras[headers[value[0]]] = value[1]
    sample_dict = {}
    sample_dict[sample_name] = Sample(sample_name, extras)
    return Gene(gene_name, sample_dict)

if __name__ == '__main__':
    delim = "\t"
    genes = Genes()
    files = sys.argv[1:]

    for file in files:
        print("Reading " + str(file))
        with open(file,'r') as f1:
            reader = csv.reader(f1, delimiter=delim)
            headers = []
            for row in reader:
                if len(headers) == 0:
                    headers = row
                else:
                    genes.create_or_add_gene(row_to_gene(headers, row))

    result = json.dumps(genes.format(), indent=4)
    print(result)
    with open('json_output.txt', 'w') as output:
        output.write(result)