ShanZhengYang ShanZhengYang -4 years ago 213
Python Question

How to sort ill-formatted pandas Dataframe columns?

I have a pandas Dataframe with over 10000 columns. These columns need to be sorted in sequential order. Normally this is quite straightforward:

import numpy as np
import pandas as pd
df = pd.read_csv("...*.csv")
df.reindex_axis(sorted(df.columns), axis=1) # sort the columns of the dataframe


For my Dataframe
df
, each of these columns is a string in the following format:

sampleFIRSTNUMBER_SECONDNUMBER


e.g.
sample42_5864183439
,
sample3_8976711222


EDIT: As an example dataframe,

df = pd.DataFrame([[1,2,3,4, 5], [5, 6, 7, 8, 9], [9, 10, 11, 12, 13]],columns=['sample1_5','sample10_5','sample100_5','sample1_50', 'sample111_500'])
print(df)

sample1_5 sample10_5 sample100_5 sample1_50 sample111_500
0 1 2 3 4 5
1 5 6 7 8 9
2 9 10 11 12 13


The problem I have is the classic issue with sorting values: At the moment,
sample10_####
comes before
sample1_####
.

The same occurs for the "SECONDNUMBER", i.e. 10000 comes before 1, before 10, etc.
That is,
chr10_10001
is currently sorted as ahead of
chr10_11
.

How would I properly format these columns such that both "FIRSTNUMER" and "SECONDNUMBER" are both sorted sequentially? I believe instead of
sample1_5
, this column must be of the format
sample01_000005
. Manually re-annotating these column names for a dataset this size is not feasibly; a for loop parsing each number may be algorithmically difficult.

Is there a pandas-friendly way to re-format these columns correctly? (Or perhaps I'm using sort incorrectly?)

Answer Source
v = df.columns.str.extract('^\D+(\d+)_(\d+)$', expand=True).values.astype(int)

df.iloc[:, np.lexsort(v.T[::-1])]

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download