Ruffy26 Ruffy26 - 1 month ago 6
Python Question

What is a pythonic way of populating in a dataframe from a csv file that has only few rows?

I have a csv file that has only the two following rows:

ID CODE DATE STARTDATE AVERAGE1 AGGREGATE MEDIAN1 MEDIAN2 AVERAGE2 AVERAGE3 POSITION
123 112 2016OCT25 2016OCT25 821 3 1021 4000 172 183 S8
124 111 2016OCT25 2016OCT25 834 3 1025 4210 173 185 ENG


The position column takes values from S1-S8 and T1-T8(16 in total).I would like to create a dataframe that looks like the following:

AVERAGE1 MEDIAN1 MEDIAN2 AVERAGE2 AVERAGE3 POSITION
S1

S2

S3

S4

S5

S6

S7

S8

T1

T2

T3

T4

T5

T6

T7

T8


Now the problem is if the 'POSITION' column in the csv file reads 'ENG' the same corresponding data in the column data has to be filled for all the positions in the dataframe, from S1-T8. If it reads any particular position , for example S8 in the second column above, the data from that column in the csv has to be filled to the corresponding POSITION in the dataframe. To make it more clear, the output should look like this:

AVERAGE1 MEDIAN1 MEDIAN2 AVERAGE2 AVERAGE3 POSITION
821 1021 4000 172 183 S1

821 1021 4000 172 183 S2

821 1021 4000 172 183 S3

821 1021 4000 172 183 S4

821 1021 4000 172 183 S5

821 1021 4000 172 183 S6

821 1021 4000 172 183 S7

834 1025 4210 173 185 S8

821 1021 4000 172 183 T1

821 1021 4000 172 183 T2

821 1021 4000 172 183 T3

821 1021 4000 172 183 T4

821 1021 4000 172 183 T5

821 1021 4000 172 183 T6

821 1021 4000 172 183 T7

821 1021 4000 172 183 T8


I have accomplished this by creating an empty dataframe and then filled the cells with df.iloc after reading the csv. Any method that is more efficient or pythonic will be much appreciated. Thanks in advance!

Answer

I think you can use reindex by pos, fillna by values from row with ENG in column POSITION, drop unnecessary columns, cast to int and last reset_index:

pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
print (pos)
['S1', 'S2', 'S3', 'S4', 'S5', 'S6', 'S7', 'S8', 
 'T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8']

df.set_index('POSITION', inplace=True)
print (df.reindex(pos).fillna(df.ix['ENG']))
             ID   CODE       DATE  STARTDATE  AVERAGE1  AGGREGATE  MEDIAN1  \
POSITION                                                                     
S1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
S8        123.0  112.0  2016OCT25  2016OCT25     821.0        3.0   1021.0   
T1        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T2        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T3        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T4        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T5        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T6        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T7        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   
T8        124.0  111.0  2016OCT25  2016OCT25     834.0        3.0   1025.0   

          MEDIAN2  AVERAGE2  AVERAGE3  
POSITION                               
S1         4210.0     173.0     185.0  
S2         4210.0     173.0     185.0  
S3         4210.0     173.0     185.0  
S4         4210.0     173.0     185.0  
S5         4210.0     173.0     185.0  
S6         4210.0     173.0     185.0  
S7         4210.0     173.0     185.0  
S8         4000.0     172.0     183.0  
T1         4210.0     173.0     185.0  
T2         4210.0     173.0     185.0  
T3         4210.0     173.0     185.0  
T4         4210.0     173.0     185.0  
T5         4210.0     173.0     185.0  
T6         4210.0     173.0     185.0  
T7         4210.0     173.0     185.0  
T8         4210.0     173.0     185.0  
df = df.reindex(pos)
       .fillna(df.ix['ENG'])
       .drop(['ID','CODE','DATE','STARTDATE'],axis=1)
       .astype(int)
       .reset_index()

print (df)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185

Another solution:

df.set_index('POSITION', inplace=True)
#remove unnecessary columns
df = df.drop(['ID','CODE','DATE','STARTDATE'],axis=1)

#index values
pos = ['S' + str(x) for x in range(1, 9)] + ['T' + str(x) for x in range(1,9)] 
#DataFrame constructor
df1 = pd.DataFrame(data=[df.ix['ENG']], index=pos)
df1.index.name = 'POSITION'

#get all not ENG index values
not_eng_idx = df.drop('ENG').index
print (not_eng_idx)
Index(['S8'], dtype='object', name='POSITION')

#overwrite DataFrame by values from df
df1.ix[not_eng_idx] = df.ix[not_eng_idx]
df1.reset_index(inplace=True)
print (df1)
   POSITION  AVERAGE1  AGGREGATE  MEDIAN1  MEDIAN2  AVERAGE2  AVERAGE3
0        S1       834          3     1025     4210       173       185
1        S2       834          3     1025     4210       173       185
2        S3       834          3     1025     4210       173       185
3        S4       834          3     1025     4210       173       185
4        S5       834          3     1025     4210       173       185
5        S6       834          3     1025     4210       173       185
6        S7       834          3     1025     4210       173       185
7        S8       821          3     1021     4000       172       183
8        T1       834          3     1025     4210       173       185
9        T2       834          3     1025     4210       173       185
10       T3       834          3     1025     4210       173       185
11       T4       834          3     1025     4210       173       185
12       T5       834          3     1025     4210       173       185
13       T6       834          3     1025     4210       173       185
14       T7       834          3     1025     4210       173       185
15       T8       834          3     1025     4210       173       185