Bhishan Poudel Bhishan Poudel - 5 months ago 52
Python Question

one way to merge two files with same "column name" and "different rows" using pandas in python

I have two datafiles

a.csv
and
b.csv
which can be obtained from pastebin:
http://pastebin.com/nzjXESYn

http://pastebin.com/PDV5Ah64

First file
a.csv
has 4 columns and some comments:

# coating file for detector A/R
# column 1 is the angle of incidence (degrees)
# column 2 is the wavelength (microns)
# column 3 is the transmission probability
# column 4 is the reflection probability
14.2 531.0 0.0618 0.9382
14.2 532.0 0.07905 0.92095
14.2 533.0 0.09989 0.90011
14.2 534.0 0.12324 0.87676
14.2 535.0 0.14674 0.85326
14.2 536.0 0.16745 0.83255
14.2 537.0 0.1837 0.8163
#
# 171 lines, 5 comments, 166 data


Second file b.csv has two columns with one common column with different number of rows:

# Version 2.0 - nm, norm@500 to 1, burrows+2006c91.21_T1350_g4.7_f100_solar
# Wavelength(nm) Flambda(ergs/cm^s/s/nm)
300.0 1.53345164121e-32
300.1 1.53345164121e-32
300.2 1.53345164121e-32

# total lines = 20003, comment lines = 2, data lines = 20001


Now, I want to merge these two files with 2nd column common (wavelength should be same in both files).

The output looks like:

# coating file for detector A/R
# column 1 is the angle of incidence (degrees)
# column 2 is the wavelength (microns)
# column 3 is the transmission probability
# column 4 is the reflection probability
# Version 2.0 - nm, norm@500 to 1, burrows+2006c91.21_T1350_g4.7_f100_solar
# Wavelength(nm) Flambda(ergs/cm^s/s/nm)
14.2 531.0 0.0618 0.9382 1.14325276212
14.2 532.0 0.07905 0.92095 1.14557732058


Note: The comments are also merged.

In file
b.csv
the wavelength is in line number = 2313.

How can we do so in python?

My initial attempt is this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author : Bhishan Poudel
# Date : Jun 17, 2016


# Imports
from __future__ import print_function
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


# read in dataframes
#======================================================================
# read in a file
#
infile = 'a.csv'
colnames = ['angle', 'wave','trans','refl']
print('{} {} {} {}'.format('\nreading file : ', infile, '','' ))
df1 = pd.read_csv(infile,sep='\s+', header = None,skiprows = 0,
comment='#',names=colnames,usecols=(0,1,2,3))

print('{} {} {} {}'.format('df.head \n', df1.head(),'',''))
#------------------------------------------------------------------


#======================================================================
# read in a file
#
infile = 'b.csv'
colnames = ['wave', 'flux']
print('{} {} {} {}'.format('\nreading file : ', infile, '','' ))
df2 = pd.read_csv(infile,sep='\s+', header = None,skiprows = 0,
comment='#',names=colnames,usecols=(0,1))
print('{} {} {} {}'.format('df.head \n', df2.head(),'','\n'))
#----------------------------------------------------------------------


result = df1.append(df2, ignore_index=True)
print(result.head())
print("\n")


Some useful links are following:

How to merge data frame with same column names

http://pandas.pydata.org/pandas-docs/stable/merging.html

Answer

If you want to merge the two datasets, you should use.merge() method, rather than .append().

result = pd.merge(df1,df2,on='wave')

The former joins two dataframes (similar to a SQL join), while the latter stacks the two dataframes on top of one another.