user1017373 user1017373 - 2 months ago 6
Python Question

Mapping multiple dataframe based on the matching columns

I have 25 data frames which I need to merge and find recurrently occurring rows from all 25 data frames,
For example, my data frame looks like following,

df1
chr start end name
1 12334 12334 AAA
1 2342 2342 SAP
2 3456 3456 SOS
3 4537 4537 ABR
df2
chr start end name
1 12334 12334 DSF
1 3421 3421 KSF
2 7689 7689 LUF
df3
chr start end name
1 12334 12334 DSF
1 3421 3421 KSF
2 4537 4537 LUF
3 8976 8976 BAR
4 6789 6789 AIN


And In the end, I am aiming to have an output data frame like following,

chr start end name Sample
1 12334 12334 AAA df1
1 12334 12334 AAA df2
1 12334 12334 AAA df3


I can get there with the following solution,
By dictionary which adds all these three data frames into one bigger data frame dfs

dfs = {'df1': df1, 'df2': df2}

Then further,

common_tups = set.intersection(*[set(df[['chr', 'start', 'end']].drop_duplicates().apply(tuple, axis=1).values) for df in dfs.values()])
pd.concat([df[df[['chr', 'start', 'end']].apply(tuple, axis=1).isin(common_tups)].assign(Sample=name) for (name, df) in dfs.items()])


This gives out the resulting data frame with matching rows from all three data frames, but I have 25 data frames which I am calling as list from the directory as following,

path = 'Fltered_vcfs/'
files = os.listdir(path)
results = [os.path.join(path,i) for i in files if i.startswith('vcf_filtered')]


And so how can I show the list 'results' in the dictionary and proceed further to get the desired output. Any help or suggestions are greatly appreciated.

Thank you

Answer

Using the glob module, you can use

import os
from glob import glob

path = 'Fltered_vcfs' 
f_names = glob(os.path.join(path, 'vcf_filtered*.*')) 

Then, your dictionary can be created with dictionary comprehension using

import pandas as pd

 {os.path.splitext(os.path.split(f_name)[1])[0]: pd.read_csv(f_name,sep='\t') for f_name in f_names}