345243lkj 345243lkj - 2 months ago 14
Python Question

How to Combine CSV Files with Pandas (And Add Identifying Column)

How do I add multiple CSV files together and an extra column to indicate where each file came from?

So far I have:

import os
import pandas as pd
import glob

os.chdir('C:\...') # path to folder where all CSVs are stored
for f, i in zip(glob.glob('*.csv'), short_list):
df = pd.read_csv(f, header = None)
df.index = i * len(df)
dfs.append(df)

all_data = pd.concat(dfs, ignore_index=True)


It all works well, except for the identifying column.
i
is a list of
strings
that I want to put in column A of
all_data
. One string for every row of each column. Instead it returns a lot of numbers, and gives a
TypeError: Index(....) must be called witha collection of some kind
.

Expected output:

str1 file1entry1
str1 file1entry2
str1 file1entry3
str2 file2entry1
str2 file2entry2
str2 file2entry3


Where
short_list = ['str1', 'str2', 'str3']
, and
file1entery1, file2entry2... etc
comes from the CSV files I already have.

Solution:
I wasn't able to get it all in one line like the solution suggested, however it pointed me in the right direction.

for f zip(glob.glob('*csv')):
df = pd.read_csv(f, header = None)
df = df.assign(id = os.path.basename(f)) # simpler than pulling from the array. Adds file name to each line.
dfs.append(df)

all_data = pd.concat(dfs)

Answer

you can use .assign(id=i) method, which will add id column to each parsed CSV and will populate it with the i value:

df = pd.concat([pd.read_csv(f, header = None).assign(id=i)
                for f, i in zip(glob.glob('*.csv), short_list)],
               ignore_index=True)