view raw
Robin Robin - 10 months ago 52
Bash Question

How to parse huge JSON file using bash command?

I need to process a large JSON file to extract information out of it. But unfortunately I don't have any experience with that. I was wondering whether someone can help me with that.
Here is how my file looks like:

"diagnoses": [
"classification_of_tumor": "not reported",
"last_known_disease_status": "not reported",
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"primary_diagnosis": "c50.9",
"submitter_id": "TCGA-AN-A0FD_diagnosis",
"tumor_stage": "stage iia",
"age_at_diagnosis": 26007.0,
"vital_status": "alive",
"morphology": "8500/3",
"days_to_death": null,
"days_to_last_known_disease_status": null,
"days_to_last_follow_up": 196.0,
"state": null,
"days_to_recurrence": null,
"diagnosis_id": "9b0c5d28-5bd6-536f-8cfb-1e96044bce38",
"tumor_grade": "not reported",
"tissue_or_organ_of_origin": "c50.9",
"days_to_birth": -26007.0,
"progression_or_recurrence": "not reported",
"prior_malignancy": "not reported",
"site_of_resection_or_biopsy": "c50.9",
"created_datetime": null
"case_id": "c6086936-7544-4da0-8c0c-114166848483",
"demographic": {
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"created_datetime": null,
"gender": "female",
"state": null,
"submitter_id": "TCGA-AN-A0FD_demographic",
"year_of_birth": 1939,
"race": "white",
"demographic_id": "423c153c-77d7-5e97-ae64-11442d5ba4f8",
"ethnicity": "not hispanic or latino",
"year_of_death": null
"exposures": [
"cigarettes_per_day": null,
"weight": null,
"updated_datetime": "2016-05-16T11:00:32.695517-05:00",
"alcohol_history": null,
"alcohol_intensity": null,
"bmi": null,
"years_smoked": null,
"height": null,
"created_datetime": null,
"state": null,
"exposure_id": "0abf6770-e176-523e-a94e-66d779c58e69",
"submitter_id": "TCGA-AN-A0FD_exposure"

The output I'm interested is text file with two column, such that column one is tumor_stage and column two is case_id for this example it would look like:

stage iia c6086936-7544-4da0-8c0c-114166848483


Here is a Python program that performs the conversion you ask for. Copy this code into a file called, for example, "". You can then run the program like so:

python my_existing_file.json my_new_file.txt

Here is the program:

import argparse
import json

# Get filenames from user
parser = argparse.ArgumentParser()
    'input', type=argparse.FileType('r'), help="input JSON filename")
    'output', type=argparse.FileType('w'), help="output 2-col text filename")
args = parser.parse_args()

# Read data in
data = json.load(args.input)

# Convert data to abstracted format
data = [[d["diagnoses"][0]["tumor_stage"], d["case_id"]] for d in data]

# Write the data out:
for d in data: