ShanZhengYang ShanZhengYang -4 years ago 213
Python Question

How to sort ill-formatted pandas Dataframe columns?

I have a pandas Dataframe with over 10000 columns. These columns need to be sorted in sequential order. Normally this is quite straightforward:

import numpy as np
import pandas as pd
df = pd.read_csv("...*.csv")
df.reindex_axis(sorted(df.columns), axis=1) # sort the columns of the dataframe

For my Dataframe
, each of these columns is a string in the following format:



EDIT: As an example dataframe,

df = pd.DataFrame([[1,2,3,4, 5], [5, 6, 7, 8, 9], [9, 10, 11, 12, 13]],columns=['sample1_5','sample10_5','sample100_5','sample1_50', 'sample111_500'])

sample1_5 sample10_5 sample100_5 sample1_50 sample111_500
0 1 2 3 4 5
1 5 6 7 8 9
2 9 10 11 12 13

The problem I have is the classic issue with sorting values: At the moment,
comes before

The same occurs for the "SECONDNUMBER", i.e. 10000 comes before 1, before 10, etc.
That is,
is currently sorted as ahead of

How would I properly format these columns such that both "FIRSTNUMER" and "SECONDNUMBER" are both sorted sequentially? I believe instead of
, this column must be of the format
. Manually re-annotating these column names for a dataset this size is not feasibly; a for loop parsing each number may be algorithmically difficult.

Is there a pandas-friendly way to re-format these columns correctly? (Or perhaps I'm using sort incorrectly?)

Answer Source
v = df.columns.str.extract('^\D+(\d+)_(\d+)$', expand=True).values.astype(int)

df.iloc[:, np.lexsort(v.T[::-1])]

enter image description here

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download