Roby Roby - 1 month ago 19
Python Question

Data import (reshape, numpy, pandas)

I have multiple directories with files inside (index), each directory has a state.
I want to loop over all files from a directory, create foreach a 2D histogram and bringing all together in one object with the ability to select rows based on the state.

For example (with a 3x3 2D-Histogram):

"Filename" , "State", "X_1", "X_2", "X_3", "X_4", "X_5", "X_6", "X_7", "X_8","X_9"

"File_1.csv", "FOO",0,0,1,2,3,0,0,0,0
"File_2.csv", "FOO",0,0,1,2,3,1,1,0,0
"File_3.csv", "FOO",0,0,4,5,3,0,0,0,0
"File_4.csv", "BAr",0,0,1,2,3,0,0,0,0
"File_5.csv", "BAR",0,0,1,2,3,1,1,0,0
"File_6.csv", "BAR",0,0,4,5,3,0,0,0,0


I've done:

def read(path, b, State):
HistList = []
HistName = []
files = os.listdir(path)

for i in range(0, len(files)):
....
hist,xe,ye = np.histogram2d( X, Y, bins=b, normed=True)
HistList.append( hist.flatten() )
NameList.append(files[i])

return DataFrame( ??? )

Answer

Why not using a dictionary?

You can create a Final_Dict{} that you pass it to the function as an argument and the function will complete that dictionary little by little for every folder and its files. In this dictionary main keys represent the folder (Final_Dict[folder_name]). Then the sub-keys of that main key are for the file names of that particular folder ( Final_Dict[folder_name][file_name]) and finally the value of that sub-key is the histogram.

Just to be clear, the following line extracts the folder name from the path:

current_folder = os.path.basename(os.path.normpath(path)) 

Code (not tested):

def read(Final_Dict, path, b, para):
        current_folder = os.path.basename(os.path.normpath(path))  
        Final_Dict[current_folder] = {}

        files = os.listdir(path)
        for i in range(0, len(files)):
          ....
          hist,xe,ye = np.histogram2d( X, Y, bins=b, normed=True)
          Final_Dict[current_folder][files[i]] = hist.flatten()

    return Final_Dict

Final_Dict = {}
b = ... 
para = ...
for folder_path in folder_path_list:
      Final_Dict = read(Final_Dict, folder_path, b, para)

After that you can convert the Final_Dict to the data frame:

Final_Dataframe = pd.DataFrame.from_dict(Final_Dict, orient='index', dtype=None)

quick example of the conversion:

import numpy as np
import pandas as pd

Final_Dict= {}
Final_Dict['folder1'] = {}
Final_Dict['folder2'] = {}

Final_Dict['folder1']['file1'] = [1,2,3]
Final_Dict['folder1']['file2'] = [1,2,3]
Final_Dict['folder2']['file1'] = [1,2,3]
Final_Dict['folder2']['file2'] = [1,2,3]

FInal_Dataframe = pd.DataFrame.from_dict(Final_Dict, orient='index', dtype=None)

print FInal_Dataframe

result :

             file2      file1
folder1  [1, 2, 3]  [1, 2, 3]
folder2  [1, 2, 3]  [1, 2, 3]