Ciaran Ciaran - 1 month ago 4
Python Question

Converting data in text file to dataframe

I'm having difficulty coming up with a simple solution to make a nice simple dataframe to work with from text in the format below:

Dose [Gy] Relative dose [%] Structure Volume [cm³]
0 0 45888.7
0.1 0.166667 27061.7
0.2 0.333333 18911.6
0.3 0.5 14907.6
0.4 0.666667 12602.7
0.5 0.833333 11127.8
0.6 1 10041.9
0.7 1.16667 9184.75
0.8 1.33333 8480.96
0.9 1.5 7885.19
1 1.66667 7382.82
1.1 1.83333 6947.77
1.2 2 6570.69
1.3 2.16667 6242.93
1.4 2.33333 5959.37
1.5 2.5 5713.12
1.6 2.66667 5497.12
1.7 2.83333 5305.86
1.8 3 5135.8
1.9 3.16667 4983.65
2 3.33333 4846.38
2.1 3.5 4720.5
2.2 3.66667 4604.54
2.3 3.83333 4496.7
2.4 4 4396.11
2.5 4.16667 4303.21


What I was doing was directly indexing the value on each line, like:

for line in lines:
value1 = line[10:20]
value3 = line[55:70]


However, its not very pythonic, and not robust at all.

Now I am trying to let pandas do the heavy lifting and am struggling to get the data to come out correctly. For example:

df = pd.read_csv(StringIO.StringIO(data), sep=" ",engine='python')


Which outputs something that still includes new lines "\n" and "'" along with the numbers.

Is there a smarter way to tackle this? Or do I need to do a lot of pre-processing before pandas can work with it?

Thanks for any help/advice!

Answer

use read_fwf as it's a fixed width file and pass the column positions as a list of tuple pairs:

In [63]:
t="""    Dose [Gy]   Relative dose [%]    Structure Volume [cm³]
            0                   0                   45888.7
          0.1            0.166667                   27061.7
          0.2            0.333333                   18911.6
          0.3                 0.5                   14907.6
          0.4            0.666667                   12602.7
          0.5            0.833333                   11127.8
          0.6                   1                   10041.9
          0.7             1.16667                   9184.75
          0.8             1.33333                   8480.96
          0.9                 1.5                   7885.19
            1             1.66667                   7382.82
          1.1             1.83333                   6947.77
          1.2                   2                   6570.69
          1.3             2.16667                   6242.93
          1.4             2.33333                   5959.37
          1.5                 2.5                   5713.12
          1.6             2.66667                   5497.12
          1.7             2.83333                   5305.86
          1.8                   3                    5135.8
          1.9             3.16667                   4983.65
            2             3.33333                   4846.38
          2.1                 3.5                    4720.5
          2.2             3.66667                   4604.54
          2.3             3.83333                    4496.7
          2.4                   4                   4396.11
          2.5             4.16667                   4303.21"""

You can see that the final df is formatted correctly:

df = pd.read_fwf(io.StringIO(t), colspecs=[(0,13),(14,33),(34,59)])
df

Out[63]:
    Dose [Gy]  Relative dose [%]  Structure Volume [cm³]
0         0.0           0.000000                45888.70
1         0.1           0.166667                27061.70
2         0.2           0.333333                18911.60
3         0.3           0.500000                14907.60
4         0.4           0.666667                12602.70
5         0.5           0.833333                11127.80
6         0.6           1.000000                10041.90
7         0.7           1.166670                 9184.75
8         0.8           1.333330                 8480.96
9         0.9           1.500000                 7885.19
10        1.0           1.666670                 7382.82
11        1.1           1.833330                 6947.77
12        1.2           2.000000                 6570.69
13        1.3           2.166670                 6242.93
14        1.4           2.333330                 5959.37
15        1.5           2.500000                 5713.12
16        1.6           2.666670                 5497.12
17        1.7           2.833330                 5305.86
18        1.8           3.000000                 5135.80
19        1.9           3.166670                 4983.65
20        2.0           3.333330                 4846.38
21        2.1           3.500000                 4720.50
22        2.2           3.666670                 4604.54
23        2.3           3.833330                 4496.70
24        2.4           4.000000                 4396.11
25        2.5           4.166670                 4303.21
Comments