Matthew Brugman Matthew Brugman - 1 month ago 4x
Python Question

Joining hundreds of columns and column indexes into a string

I've got what I hope is a unique/interesting problem for my first question on Stack Overflow!

I have data on skills assessments, currently in a very large pandas dataframe. Each row represents a student, and each column contains their scores for a particular skills assessment. There are about 200 skill assessments in total, with each student having a score in only a small subset of these assessments (1 - 20 scores is typical, but some students have more).

Example dataframe structure:

id skill1 skill2 skill3 skill4 skill5 ....
1 10 50 NaN 3 NaN
2 Nan 10 2 70 NaN
3 23 NaN 45 NaN 5

I am attempting to get this data transformed into a space-delimited string for each student, in the following format, so that we can import it into a different datastore:

skill1:10 skill2:50 skill4:3
skill2:10 skill3:2 skill4:70

(notice how skills without assessments scores don't get added to the list)

I've created a lambda function to join all of these skill values with their column labels:

skillmerge = lambda row: ' '.join([str(row.index[i])+':'+str(row[i]) for i in range(0,len(row)) if row[i]!=np.nan])

When I created a single series (1 student) to test on, the lambda function takes less than a second to create the output string in my desired format. However, when I create a dataframe with just 2 rows (again for testing purposes), the function takes several minutes just to complete those 2 rows:

testing_df['combined_skills'] = testing_df.apply(skillmerge, axis=1)

Seeing as how I have a couple million students in this dataset, I am looking for way to make this process reliably work faster. Any thoughts on where I can fix this?

Thanks in advance for helping with my first SO question! :D


Using to_json then fixing it

def to_str(x):
    return x.dropna().to_json(double_precision=0) \
            .replace('"', '').replace(',', ' ').strip("{}")


Or using list comprehension and join

def to_str(x):
    return " ".join(["{}:{}".format(k, int(v)) for k, v in x.dropna().iteritems()])


Both give

1    skill1:10 skill2:50 skill4:3
2    skill2:10 skill3:2 skill4:70
3    skill1:23 skill3:45 skill5:5
dtype: object

Making your solution work

skillmerge = lambda row: ' '.join([str(row.index[i])+':'+str(row[i]) for i in range(len(row)) if not np.isnan(row[i])])


Notice that np.nan == np.nan evaluates to False. In order to test for np.nan use np.isnan or pd.isnull or pd.notnull. This fact was throwing off your solution. I replaced it with not np.isnan and it works.

I took the opportunity to do what I'd do because I like it better.