Alexander Alexander -3 years ago 85
Python Question

Multiple delimiters in column headers also separates the row values

I had a some issue about defining multiple seperator when reading a file. It is originally solved in my previous post reading-files-with-multiple-delimiter-in-column-headers-and-skipping-some-rows thanks to @piRsquared

When I looked in detail to my real data I
realized that some columns have .cd or .dvd extensions and when I applied the solution above they are also separated as a new column and the solution above started not to work!

b.txt

skip1
A1| A2 |A3 |A4# A5# A6 A7| A8 , A9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9
1,2,3,4,5.cd,6,7,8.dvd,9

END123
Some other data starts from here


and read this b.txt file using the solution above

txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
pd.io.common.StringIO(txt),
sep=r'\W+',
skiprows=1,index_col=False, engine='python')

A1 A2 A3 A4 A5 A6 A7 A8 A9
0 1 2 3 4 5 cd 6 7 8
1 1 2 3 4 5 cd 6 7 8
2 1 2 3 4 5 cd 6 7 8


A5 column should have the rows

5.cd
5.cd
5.cd


and same for the A9 column

8.dvd
8.dvd
8.dvd


and we should have A9 column but seems that it disappears due to this conflict.

Answer Source

I used \W+ as a fast and easy way to parse what you showed. Below I used something more specific to the actual delimiters you need.

txt = open('b.txt').read().split('\nEND')[0]
pd.read_csv(
    pd.io.common.StringIO(txt),
    sep=r'[\|, ,#,\,]+',
    skiprows=1,index_col=False, engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9

However, I still think this is a cleaner way to do it. Here, I separate the parsing of the header from the parsing of the rest of the data. That way, I assume the data should only be using , as a separator.

txt = open('b.txt').read().split('END')[0]
_, h, txt = txt.split('\n', 2)
pat = r'[\|, ,#,\,]+'
names = re.split(pat, h.strip())

pd.read_csv(
    pd.io.common.StringIO(txt),
    names=names, header=None,
    engine='python')

   A1  A2  A3  A4    A5  A6  A7     A8  A9
0   1   2   3   4  5.cd   6   7  8.dvd   9
1   1   2   3   4  5.cd   6   7  8.dvd   9
2   1   2   3   4  5.cd   6   7  8.dvd   9
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download