Ryan Ryan - 1 month ago 12
Python Question

How to restructure data using a loop (without crashing computer!) and write to csv in python

Using read_csv in Pandas,
I have imported a huge data set with >500K rows, each row contains a taxonomic code with location and abundance values from a specific date and station. These values repeat for different stations over time. I cannot create a unique time stamp because time was not recorded, thus I only have the date.

My columns are : Cruise Name, Station number, Latitude, Longitude, Date(YY/MM/DD), Taxonomic Code, Abundance

I need to rearrange the data such that my columns will be the individual taxonomic codes (n>400) as the column name with abundance as values for those columns, and the rows will be occurrence with unique index consisting of location and date information. To further complicate this, I need to include zeros where there were no observations for the taxonomic codes for those particular samples




edit: I created a new data frame with a unique identifier including all pertinent location information in a single column, the abundance, and taxonomic ID using:

df['ID'] = df[['timestamp','cruise','station','lat','long','depth']].apply(lambda x: ','.join(map(str, x)), axis=1)


df3 = pd.DataFrame([df.ID,df.TaxonomicCode,df.Abundance]


ID oldtxc zoodns100
0 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 101 114.95
1 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 102 40118.18


define variables and count using numpy unique:

species = df3['TaxonomicCode']


cruise=df3['ID']


taxa=np.unique(species)
#419

locats = np.unique(cruise)
#27530

I then created another data frame filled with zeros using:

aa=pd.DataFrame(index=locats, columns=taxa)
#create empty matrix

aa=aa.fillna(0)
#fill NaN with 0

2 100 101 102 103 104 105 106 107 108 ... 4500 4504 4601 4604 4700 5000 5100 5101 5150 9114
1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,33.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0


Now I want to loop through the raw data to fill in the values of 'aa'?

for d in range(len(df3)):
loc1 = df3.ID[d]
tax1 = df3.oldtxc[d]
locIndex = np.where(locats == loc1)[0][0]
taxIndex = np.where(taxa == tax1)[0][0]
aa[locIndex, taxIndex] = df3.zoodns100[d]


This works initially and gives the expected output, but crashes my computer after only ~ 1000 iterations (with 509K iterations left to go... gulp.) I am using iPython notebook and/or spyder and the result is the same. Could this be a memory issue?

Ultimately I need to print the resulting dataframe 'aa' to a CSV file, is there a way to do this during the iteration to clear memory?




edit 2

Now I see that my loop is causing a vertical concatenation of columns onto my dataframe 'aa', which could explain the crashing. What I am attempting to do is to fill in the values of 'aa' (which is and should ultimately be 419 rows x 27530 columns) from a single column in dataframe ('df3') which is 510K rows long and contains the abundance values for each observed species at each location. I need to match all of the abundance data from each location into my respective columns of 'aa' for each row of 'aa.' Each row of 'aa' is a unique location index, created using 'np.unique' on the location values in 'df3' 510K -> 27K unique stations.

I am trying to iterate over the length 'df3' and attempting to find the matching location indices (rows) in 'df3' and pair the abundance data associated with those indices to the corresponging single row in 'aa'

(clear as mud, right? this is hard to explain...)

I am sure it is my syntax, but I want to assign values to dataframe 'aa'
aa[row, column] = value

this appears to be causing a concatenation, rather than assigning values. What am I missing here?

Answer

I can't understand your entire question, but I have a couple pointers that might help.

First, I don't think there's any reason for your statement:

aa=aa.fillna(0)

There's no benefit to preallocating all of those zeros, and it clutters your code.

I think instead it's more efficient for you to have something along the lines of:

aa=pd.DataFrame(index=locats, columns=taxa)
for d in range(len(df3))
    ...
    #build a Series (new_row) for Just one row 
    ...
    aa = aa.append(new_row, ignore_index=True) #T/F depending on what you want

Also, you might want to reconsider your for loop. Pandas has an iterrows() function that you could use instead. You use it like this:

for row_index, row in df3.iterrows(): 

When it comes to concatenating, you may introduce new performance issues. There's a post here that discusses your options. But if you look, those are talking about millions, and yours are much less. So I think there's hope.

Along those lines, don't feel obligated to solve the entire problem in one iteration. That's another reason not to allocate everything in advance. If you have genuine performance issues, it might be possible to break it off in chunks. For example, every 1000 rows iterated, you could flush your current DataFrame to a .csv file, thereby releasing that memory. You might end up with 500 .csv files, but then a separate function would be able to read them all in. Assuming they are the only .csv files in the directory:

def concatinate_files(files_path):
    file_list= []
    for file_ in os.listdir(files_path):
        if file_.endswith('.csv'):
            file_list.append(files_path + '/' + file_)
   combined_df = DataFrame()
   for file_name in file_list:
        df = pd.read_csv(file_name)
       combined_df = combined_df.append(df, ignore_index=False)

Hope that helps.

UPDATE 8/20 in response to your 'edit2'

Your problem in your most recent post is that if 'row' and 'column' are integers, than you are trying to use integer location indexing, but not calling the correct function (iloc). That causes columns to be appended. Try running this example code to see what I mean.

df = DataFrame(np.random.randn(4, 4))
df['1','2']=3   #not what you want
print df
df.iloc[1,2]=3  #what I think you mean
print df        

Again, this goes back to my original suggestion though. I don't think it's in your best interest to allocate the 419x27530 up front. I think some of your problems are from your mindset/insistence to try to fit things that way. Besides the preallocation, you mention that your data orientation is a problem, but I'm not clear on exactly how that is. It's perfectly valid to build your results as 27530x1, 27530x2 ... 27530x419 and then call DataFrame.Transpose (df.T) to get the 419x27530 orientation you want.