zsad512 zsad512 - 2 months ago 20
Python Question

Python: Multiple dataframes from multiple CSV, encoding cp1252 to utf8

I have a zipfile of several CSV documents. I have extracted the CSV's into a folder called "staging." These documents are encoded in Windows CP1252. What I would like to do is read in each CSV file individually as a separate dataframe and then overwrite the old files with utf8 encoding after I have removed all of the null values. Or instead of rewriting the CSVs to utf8 I can encode the database strictly from the pandas dataframes that are produced. Any help would be greatly appreciated- I have browsed the Stack Overflow forums and the main topic seems to be concatenating multiple CSV's into a single dataframe- what I need is a separate dataframe for each CSV. Also, I have to remove N/A values, however, in the CSV's they have random numbers attached to them (ie- N/A (3) or N/A(1), etc)

Here is the code I am working with:

# Create the staging directory
staging_dir = "staging"
os.mkdir(staging_dir)

# Confirm the staging directory path
os.path.isdir(staging_dir)

# Machine independent path to create files
zip_file = os.path.join(staging_dir, "Hospital_Revised_Flatfiles.zip")

# Write the files to the computer
zf = open(zip_file,"wb")
zf.write(r.content)
zf.close()

# Program to unzip the files
import zipfile

z = zipfile.ZipFile(zip_file,"r")
z.extractall(staging_dir)
z.close()

#Create the dataframes

import io
import glob
import pandas as pd

files = glob.glob(os.path.join("staging" + "/*.csv"))

# OS independent reading of files
for file in files:
dfs = pd.read_csv(file, header = 0, encoding = 'cp1252')

Answer Source

I believe P.Tillmann's solution should've worked. Alternatively, you can load all your dataframes first and then write them back.

files = glob.glob(os.path.join("staging" + "/*.csv"))

dict_ = {}
for file in files:
    dict_[file] = pd.read_csv(file, header=0, encoding='cp1252').dropna()

for file in dict_:
    dict_[file].to_csv(file, encoding='utf-8')