Stefano Potter Stefano Potter - 3 months ago 24
Python Question

Splitting dataframe and saving to txt files

I have a dataframe as such:

Histogram DN Npts Total Percent Acc Pct
Band 1 -0.054741 1 1 0.0250 0.0250
Bin=0.00233 -0.052404 0 1 0.0000 0.0250
-0.050067 0 1 0.0000 0.0250
-0.047730 0 1 0.0000 0.0250
-0.045393 0 1 0.0000 0.0250
-0.043056 0 1 0.0000 0.0250
-0.040719 0 1 0.0000 0.0250
Histogram DN Npts Total Percent Acc Pct
Band 2 0.000000 346 346 9.5186 9.5186
Bin=0.00203 0.002038 0 346 0.0000 9.5186
0.004076 0 346 0.0000 9.5186
0.006114 0 346 0.0000 9.5186
0.008152 0 346 0.0000 9.5186
0.010189 0 346 0.0000 9.5186
0.012227 0 346 0.0000 9.5186


and I want to split this based on when the word Histogram occurs (in this case every 8 lines). I could split it like this:

np.array_split(df,8)


but if there is a way to do it on the keyword I would prefer it. I then want to save each split into its own textfile. Is there a way to do this?

df.head().to_json()
returns:

{"Histogram ":{"0":"Band 1 ","1":"Bin=0.00233","2":" ","3":" ","4":" "}," DN":{"0":"-0.054741","1":"-0.052404","2":"-0.050067","3":"-0.047730","4":"-0.045393"}," Npts":{"0":" 1","1":" 0","2":" 0","3":" 0","4":" 0"}," Total":{"0":" 1","1":" 1","2":" 1","3":" 1","4":" 1"}," Percent":{"0":" 0.0250","1":" 0.0000","2":" 0.0000","3":" 0.0000","4":" 0.0000"}," Acc Pct":{"0":" 0.0250","1":" 0.0250","2":" 0.0250","3":" 0.0250","4":" 0.0250"}}

Answer

Firstly you should normalize your column names, at the moment they contain spaces (this explains the KeyError you saw earlier):

In [11]: df1.columns
Out[11]:
Index(['       DN', '   Npts', '  Total', ' Acc Pct', ' Percent', 'Histogram  '], dtype='object')

In [12]: df1.columns.map(lambda x: x.strip())
Out[12]: array(['DN', 'Npts', 'Total', 'Acc Pct', 'Percent', 'Histogram'], dtype=object)

In [13]: df1.columns = df1.columns.map(lambda x: x.strip())

To group by the Band, I'd use cumsum:

In [14]: df1  # similar to your example
Out[14]:
         DN  Npts  Total  Acc Pct  Percent    Histogram
0 -0.054741     1      1    0.025    0.025  Band 1
1 -0.052404     0      1    0.025    0.000  Bin=0.00233
2 -0.050067     0      1    0.025    0.000
3 -0.047730     0      1    0.025    0.000
4 -0.045393     0      1    0.025    0.000
5 -0.054741     1      1    0.025    0.025  Band 2
6 -0.052404     0      1    0.025    0.000  Bin=0.00233
7 -0.050067     0      1    0.025    0.000
8 -0.047730     0      1    0.025    0.000
9 -0.045393     0      1    0.025    0.000

In [15]: df1["Histogram"].str.startswith("Band").cumsum()
Out[15]:
0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    2
9    2
Name: Histogram, dtype: int64

You can use this to groupby (which is how you want to split):

In [16]: g = df1.groupby(df1["Histogram"].str.startswith("Band").cumsum())

Now you can extract/clean at your leisure:

In [21]: g.get_group(1)
Out[21]:
         DN  Npts  Total  Acc Pct  Percent    Histogram
0 -0.054741     1      1    0.025    0.025  Band 1
1 -0.052404     0      1    0.025    0.000  Bin=0.00233
2 -0.050067     0      1    0.025    0.000
3 -0.047730     0      1    0.025    0.000
4 -0.045393     0      1    0.025    0.000

In [22]: [x for _, x in g]
Out[22]:
[         DN  Npts  Total  Acc Pct  Percent    Histogram
 0 -0.054741     1      1    0.025    0.025  Band 1
 1 -0.052404     0      1    0.025    0.000  Bin=0.00233
 2 -0.050067     0      1    0.025    0.000
 3 -0.047730     0      1    0.025    0.000
 4 -0.045393     0      1    0.025    0.000             ,
          DN  Npts  Total  Acc Pct  Percent    Histogram
 5 -0.054741     1      1    0.025    0.025  Band 2
 6 -0.052404     0      1    0.025    0.000  Bin=0.00233
 7 -0.050067     0      1    0.025    0.000
 8 -0.047730     0      1    0.025    0.000
 9 -0.045393     0      1    0.025    0.000             ]
Comments