mrh5028 - 1 year ago 109

Python Question

I need to take a CSV file and split the rows and have them cascade. The input CSV can have a varying amount of columns(always even), but will always be split the same way. I decided to use Pandas because with some files the output will be 500,000 rows and I thought it would speed things up.

Input:

`h1 h2 h3 h4 h5 h6`

A1 A2 A3 A4 A5 A6

B1 B2 B3 B4 B5 B6

Expected Output

`h1 h2 h3 h4 h5 h6`

A1 A2

A1 A2 A3 A4

A1 A2 A3 A4 A5 A6

B1 B2

B1 B2 B3 B4

B1 B2 B3 B4 B5 B6

I tried using the code below (cobbled together from some searching and my own edits) as you can see it is close, but not quite what I need.

`importFile = pd.read_csv('file.csv')`

df = df_importFile = pd.DataFrame(importFile)

index_cols = ['h1']

cols = [c for c in df if c not in index_cols]

df2 = df.set_index(index_cols).stack().reset_index(level=1, drop=True).to_frame('Value')

df2 = pd.concat([pd.Series([v if i % len(cols) == n else ''

for i, v in enumerate(df2.Value)], name=col)

for n, col in enumerate(cols)], axis=1).set_index(df2.index)

df2.to_csv('output.csv')

That gives the following

`h1 h2 h3 h4 h5 h6`

A1 A2

A1 A3

A1 A4

A1 A5

A1 A6

Answer Source

```
# take number of columns and divide by 2
# this is the number of pairs
pairs = df.shape[1] // 2
# np.repeat takes the number of rows and returns an object to slice
# the dataframe array df.values... then slice... result should be
# of length pairs * len(df)
a = df.values[np.repeat(np.arange(df.shape[0]), pairs)]
# row values to condition with as column vector
dim0 = (np.arange(a.shape[0]) % (pairs))[:, None ]
# column values to condition with as row vector
dim1 = np.repeat(np.arange(pairs), 2)
# boolean mask to use in np.where generated
# via the magic of numpy broadcasting
mask = dim0 >= dim1
# QED
pd.DataFrame(np.where(mask, a, ''), columns=df.columns)
```