novicebioinforesearcher novicebioinforesearcher - 27 days ago 10
Python Question

creating a matrix using python for biologist

I am asking this question is in general for many biologist/bioinformatics researchers who find it difficult to construct a matrix from their gene expression data, I tried googling and finding answers I am surprised there are not many of them addressing this problem in particular , I have asked the same in the past but it was not executable, here is the typical problem

there would be different files with rows with a gene_id and columns with score and other meta information e.g sample1 typically will have 200000 rows

gene_id score metainfo1 metainfo2
gene1 20 constitutive donor
gene2 30 alternative acceptor

ideally for downstream analysis biologists always would want to build a matrix where first collect all the gene_ids from all files and place it in column 1 and append scores form each file gene_id and where the score is not available add a '0', something like this and keep the column name for score as filename (metainfo can be optional sometimes it may be required)

gene_id score_sample1 score_sample2....score_samplen metainfo1 metainfo2

If any one can contribute a step by step procedure using python that can be dynamically applied It will be of great help to biologists with skewed programming knowledge.

unique_id col1 col2 col3 score col5 col6 col7 col8 col9 col10 col11 col12 col13 col14

have 20 files with this data need to make a matrix (col is metainfo) with just

unique_id(from all files) score col3 col4 col7 col9 col14


bli bli
Answer Source

Suppose we have these two files:

$ cat sample1.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene2   30  alternative acceptor
$ cat sample2.txt 
gene_id score   metainfo1   metainfo2
gene1   20  constitutive    donor
gene3   30  alternative acceptor

You can read the data using pandas dataframes.

import pandas as pd
sample1 = pd.read_table("sample1.txt", index_col=0)["score"]
sample2 = pd.read_table("sample2.txt", index_col=0)["score"]

Merge it "horizontally" (axis=1) and change missing values to 0:

concatenated = pd.concat([sample1, sample2], axis=1).fillna(0)

Set new column names:

concatenated.columns = ["score_sample1", "score_sample2"]

Now we can extract the meta-information (all lines, last two columns):

meta1 = pd.read_table("sample1.txt", index_col=0).iloc[:,-2:]
meta2 = pd.read_table("sample2.txt", index_col=0).iloc[:,-2:]

Merge it "vertically" (default "axis" parameter is 0):

meta = pd.concat([meta1, meta2])

Remove duplicate lines (

meta = meta[~meta.index.duplicated(keep="first")]

Concatenate it "horizontally" to the scores:

concatenated = pd.concat([concatenated, meta], axis=1)

And we obtain this:

         score_sample1  score_sample2     metainfo1 metainfo2
gene1             20.0           20.0  constitutive     donor
gene2             30.0            0.0   alternative  acceptor
gene3              0.0           30.0   alternative  acceptor

Addendum (24/08/2017): With more files

Suppose you have actually 20 sample*.txt files.

You can probably generalize the above method by generating lists of DataFrames as follows:

import pandas as pd
filenames = ["sample%d" % n for n in range(1,21)]
samples = [pd.read_table(filename, index_col=0)["score"] for filename in filenames]
concatenated = pd.concat(samples, axis=1).fillna(0)
concatenated.columns = ["score_sample%d" % n for n in range(1, 21)]
metas = [pd.read_table(filename, index_col=0).iloc[:,-2:] for filename in filenames]
meta = pd.concat(metas)
meta = meta[~meta.index.duplicated(keep="first")]
concatenated = pd.concat([concatenated, meta], axis=1)