Roby Roby - 1 month ago 16
Python Question

Data import (reshape, bumpy, pandas)

i have multiple directories with files inside.

I want to loop over a directory and create a 2D histogram, flatten the histogram and create a matrix (rowwise files, columns histogram bins)

I want to add the Name of the file to each Row and a directory specific parameter (and final concat each directory) i.e.:

"File_1.csv", "FOO",0,0,1,2,3,0,0
"File_2.csv", "FOO",0,0,1,2,3,1,1
"File_3.csv", "FOO",0,0,4,5,3,0,0

"File_4.csv", "BAr",0,0,1,2,3,0,0
"File_5.csv", "BAR",0,0,1,2,3,1,1
"File_6.csv", "BAR",0,0,4,5,3,0,0


I've done:

def read(path, b, para):
HistList = []
HistName = []
files = os.listdir(path)

for i in range(0, len(files)):
....
hist,xe,ye = np.histogram2d( X, Y, bins=b, normed=True)
HistList.append( hist.flatten() )
NameList.append(files[i])

return DataFrame( ??? )

Answer

Why not using a dictionary?

You can create a Final_Dict{} that you pass it to the function as an argument and the function will complete that dictionary little by little for every folder and its files. In this dictionary main keys represent the folder (Final_Dict[folder_name]). Then the sub-keys of that main key are for the file names of that particular folder ( Final_Dict[folder_name][file_name]) and finally the value of that sub-key is the histogram.

Just to be clear, the following line extracts the folder name from the path:

current_folder = os.path.basename(os.path.normpath(path)) 

Code (not tested):

def read(Final_Dict, path, b, para):
        current_folder = os.path.basename(os.path.normpath(path))  
        Final_Dict[current_folder] = {}

        files = os.listdir(path)
        for i in range(0, len(files)):
          ....
          hist,xe,ye = np.histogram2d( X, Y, bins=b, normed=True)
          Final_Dict[current_folder][files[i]] = hist.flatten()

    return Final_Dict

Final_Dict = {}
b = ... 
para = ...
for folder_path in folder_path_list:
      Final_Dict = read(Final_Dict, folder_path, b, para)

After that you can convert the Final_Dict to the data frame:

Final_Dataframe = pd.DataFrame.from_dict(Final_Dict, orient='index', dtype=None)

quick example of the conversion:

import numpy as np
import pandas as pd

Final_Dict= {}
Final_Dict['folder1'] = {}
Final_Dict['folder2'] = {}

Final_Dict['folder1']['file1'] = [1,2,3]
Final_Dict['folder1']['file2'] = [1,2,3]
Final_Dict['folder2']['file1'] = [1,2,3]
Final_Dict['folder2']['file2'] = [1,2,3]

FInal_Dataframe = pd.DataFrame.from_dict(Final_Dict, orient='index', dtype=None)

print FInal_Dataframe

result :

             file2      file1
folder1  [1, 2, 3]  [1, 2, 3]
folder2  [1, 2, 3]  [1, 2, 3]