Using read_csv in Pandas,
I have imported a huge data set with >500K rows, each row contains a taxonomic code with location and abundance values from a specific date and station. These values repeat for different stations over time. I cannot create a unique time stamp because time was not recorded, thus I only have the date.
My columns are : Cruise Name, Station number, Latitude, Longitude, Date(YY/MM/DD), Taxonomic Code, Abundance
I need to rearrange the data such that my columns will be the individual taxonomic codes (n>400) as the column name with abundance as values for those columns, and the rows will be occurrence with unique index consisting of location and date information. To further complicate this, I need to include zeros where there were no observations for the taxonomic codes for those particular samples
df['ID'] = df[['timestamp','cruise','station','lat','long','depth']].apply(lambda x: ','.join(map(str, x)), axis=1)
df3 = pd.DataFrame([df.ID,df.TaxonomicCode,df.Abundance]
ID oldtxc zoodns100
0 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 101 114.95
1 1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,... 102 40118.18
species = df3['TaxonomicCode']
locats = np.unique(cruise)
2 100 101 102 103 104 105 106 107 108 ... 4500 4504 4601 4604 4700 5000 5100 5101 5150 9114
1977-02-13 00:00:00,MM7701,2,41.1833,-70.6667,33.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
for d in range(len(df3)):
loc1 = df3.ID[d]
tax1 = df3.oldtxc[d]
locIndex = np.where(locats == loc1)
taxIndex = np.where(taxa == tax1)
aa[locIndex, taxIndex] = df3.zoodns100[d]
aa[row, column] = value
I can't understand your entire question, but I have a couple pointers that might help.
First, I don't think there's any reason for your statement:
There's no benefit to preallocating all of those zeros, and it clutters your code.
I think instead it's more efficient for you to have something along the lines of:
aa=pd.DataFrame(index=locats, columns=taxa) for d in range(len(df3)) ... #build a Series (new_row) for Just one row ... aa = aa.append(new_row, ignore_index=True) #T/F depending on what you want
Also, you might want to reconsider your for loop. Pandas has an iterrows() function that you could use instead. You use it like this:
for row_index, row in df3.iterrows():
When it comes to concatenating, you may introduce new performance issues. There's a post here that discusses your options. But if you look, those are talking about millions, and yours are much less. So I think there's hope.
Along those lines, don't feel obligated to solve the entire problem in one iteration. That's another reason not to allocate everything in advance. If you have genuine performance issues, it might be possible to break it off in chunks. For example, every 1000 rows iterated, you could flush your current DataFrame to a .csv file, thereby releasing that memory. You might end up with 500 .csv files, but then a separate function would be able to read them all in. Assuming they are the only .csv files in the directory:
def concatinate_files(files_path): file_list=  for file_ in os.listdir(files_path): if file_.endswith('.csv'): file_list.append(files_path + '/' + file_) combined_df = DataFrame() for file_name in file_list: df = pd.read_csv(file_name) combined_df = combined_df.append(df, ignore_index=False)
Hope that helps.
UPDATE 8/20 in response to your 'edit2'
Your problem in your most recent post is that if 'row' and 'column' are integers, than you are trying to use integer location indexing, but not calling the correct function (iloc). That causes columns to be appended. Try running this example code to see what I mean.
df = DataFrame(np.random.randn(4, 4)) df['1','2']=3 #not what you want print df df.iloc[1,2]=3 #what I think you mean print df
Again, this goes back to my original suggestion though. I don't think it's in your best interest to allocate the 419x27530 up front. I think some of your problems are from your mindset/insistence to try to fit things that way. Besides the preallocation, you mention that your data orientation is a problem, but I'm not clear on exactly how that is. It's perfectly valid to build your results as 27530x1, 27530x2 ... 27530x419 and then call DataFrame.Transpose (df.T) to get the 419x27530 orientation you want.