eboylen eboylen - 6 months ago 69
Python Question

Appending variable length columns in Pandas dataframe Python

I have a few csv files which contain a pair of bearings for many locations. I am trying to expand the values to include every number between the bearing pairs for each location and export the variable lengths as a csv in the same format.

Example:

df = pd.read_csv('bearing.csv')


Data structure:

A B C D E
0 0 94 70 67 84
1 120 132 109 152 150


Ideal result is a variable length multidimensional array:

A B C D E
0 0 94 70 67 84
1 1 95 71 68 85
2 3 96 72 69 86
...
n 120 132 109 152 150


I am looping through each column and getting the range of the pair of values, but I am struggling when trying to overwrite the old column with the new range of values.

for col in bear:
min_val = min(bear[col])
max_val = max(bear[col])
range_vals = range(min(bear[col]), max(bear[col])+1)
bear[col] = range_vals


I am getting the following error:

ValueError: Length of values does not match length of index

Answer

You can use dict comprehension with min and max in DataFrame contructor, but you get a lot NaN in the end of columns:

df = pd.DataFrame({col: pd.Series(range(df[col].min(), 
                                        df[col].max() + 1)) for col in df.columns })
print (df)
print (df)

       A      B     C     D      E
0      0   94.0  70.0  67.0   84.0
1      1   95.0  71.0  68.0   85.0
2      2   96.0  72.0  69.0   86.0
3      3   97.0  73.0  70.0   87.0
4      4   98.0  74.0  71.0   88.0
5      5   99.0  75.0  72.0   89.0
6      6  100.0  76.0  73.0   90.0
7      7  101.0  77.0  74.0   91.0
8      8  102.0  78.0  75.0   92.0
9      9  103.0  79.0  76.0   93.0
10    10  104.0  80.0  77.0   94.0
11    11  105.0  81.0  78.0   95.0
12    12  106.0  82.0  79.0   96.0
13    13  107.0  83.0  80.0   97.0
14    14  108.0  84.0  81.0   98.0
15    15  109.0  85.0  82.0   99.0
16    16  110.0  86.0  83.0  100.0
17    17  111.0  87.0  84.0  101.0
18    18  112.0  88.0  85.0  102.0
19    19  113.0  89.0  86.0  103.0
20    20  114.0  90.0  87.0  104.0
21    21  115.0  91.0  88.0  105.0
22    22  116.0  92.0  89.0  106.0
23    23  117.0  93.0  90.0  107.0
24    24  118.0  94.0  91.0  108.0
25    25  119.0  95.0  92.0  109.0
26    26  120.0  96.0  93.0  110.0
27    27  121.0  97.0  94.0  111.0
28    28  122.0  98.0  95.0  112.0
29    29  123.0  99.0  96.0  113.0
..   ...    ...   ...   ...    ...
91    91    NaN   NaN   NaN    NaN
92    92    NaN   NaN   NaN    NaN
93    93    NaN   NaN   NaN    NaN
94    94    NaN   NaN   NaN    NaN
95    95    NaN   NaN   NaN    NaN
96    96    NaN   NaN   NaN    NaN
97    97    NaN   NaN   NaN    NaN
98    98    NaN   NaN   NaN    NaN
99    99    NaN   NaN   NaN    NaN
100  100    NaN   NaN   NaN    NaN
101  101    NaN   NaN   NaN    NaN
102  102    NaN   NaN   NaN    NaN
103  103    NaN   NaN   NaN    NaN
104  104    NaN   NaN   NaN    NaN
105  105    NaN   NaN   NaN    NaN
106  106    NaN   NaN   NaN    NaN
107  107    NaN   NaN   NaN    NaN
108  108    NaN   NaN   NaN    NaN
109  109    NaN   NaN   NaN    NaN
110  110    NaN   NaN   NaN    NaN
111  111    NaN   NaN   NaN    NaN
112  112    NaN   NaN   NaN    NaN
113  113    NaN   NaN   NaN    NaN
114  114    NaN   NaN   NaN    NaN
115  115    NaN   NaN   NaN    NaN
116  116    NaN   NaN   NaN    NaN
117  117    NaN   NaN   NaN    NaN
118  118    NaN   NaN   NaN    NaN
119  119    NaN   NaN   NaN    NaN
120  120    NaN   NaN   NaN    NaN

If you have only few columns, is possible use:

df = pd.DataFrame({'A': pd.Series(range(df.A.min(), df.A.max() + 1)),
                   'B': pd.Series(range(df.B.min(), df.B.max() + 1))})

EDIT:

If min value is in first row and the max in last, you can use iloc:

df = pd.DataFrame({col: pd.Series(range(df[col].iloc[0], 
                                        df[col].iloc[-1] + 1)) for col in df.columns })

Timings:

In [3]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].iloc[0], df[col].iloc[-1] + 1)) for col in df.columns }) )
1000 loops, best of 3: 1.75 ms per loop

In [4]: %timeit ( pd.DataFrame({col: pd.Series(range(df[col].min(), df[col].max() + 1)) for col in df.columns })  )
The slowest run took 5.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.18 ms per loop