mrh5028 mrh5028 - 1 year ago 77
Python Question

Dynamically Splitting Rows in dataframe

I need to take a CSV file and split the rows and have them cascade. The input CSV can have a varying amount of columns(always even), but will always be split the same way. I decided to use Pandas because with some files the output will be 500,000 rows and I thought it would speed things up.


h1 h2 h3 h4 h5 h6
A1 A2 A3 A4 A5 A6
B1 B2 B3 B4 B5 B6

Expected Output

h1 h2 h3 h4 h5 h6
A1 A2
A1 A2 A3 A4
A1 A2 A3 A4 A5 A6
B1 B2
B1 B2 B3 B4
B1 B2 B3 B4 B5 B6

I tried using the code below (cobbled together from some searching and my own edits) as you can see it is close, but not quite what I need.

importFile = pd.read_csv('file.csv')
df = df_importFile = pd.DataFrame(importFile)

index_cols = ['h1']
cols = [c for c in df if c not in index_cols]

df2 = df.set_index(index_cols).stack().reset_index(level=1, drop=True).to_frame('Value')

df2 = pd.concat([pd.Series([v if i % len(cols) == n else ''
for i, v in enumerate(df2.Value)], name=col)
for n, col in enumerate(cols)], axis=1).set_index(df2.index)


That gives the following

h1 h2 h3 h4 h5 h6
A1 A2
A1 A3
A1 A4
A1 A5
A1 A6

Answer Source
# take number of columns and divide by 2
# this is the number of pairs
pairs = df.shape[1] // 2

# np.repeat takes the number of rows and returns an object to slice
# the dataframe array df.values... then slice... result should be 
# of length pairs * len(df)
a = df.values[np.repeat(np.arange(df.shape[0]), pairs)]

# row values to condition with as column vector
dim0 = (np.arange(a.shape[0]) % (pairs))[:, None ]

# column values to condition with as row vector
dim1 = np.repeat(np.arange(pairs), 2)

# boolean mask to use in np.where generated
# via the magic of numpy broadcasting
mask = dim0 >= dim1

pd.DataFrame(np.where(mask, a, ''), columns=df.columns)

enter image description here