Matthew Brugman - 1 year ago 56

Python Question

I've got what I hope is a unique/interesting problem for my first question on Stack Overflow!

I have data on skills assessments, currently in a very large pandas dataframe. Each row represents a student, and each column contains their scores for a particular skills assessment. There are about 200 skill assessments in total, with each student having a score in only a small subset of these assessments (1 - 20 scores is typical, but some students have more).

Example dataframe structure:

`id skill1 skill2 skill3 skill4 skill5 ....`

1 10 50 NaN 3 NaN

2 Nan 10 2 70 NaN

3 23 NaN 45 NaN 5

I am attempting to get this data transformed into a space-delimited string for each student, in the following format, so that we can import it into a different datastore:

`skill1:10 skill2:50 skill4:3`

skill2:10 skill3:2 skill4:70

(notice how skills without assessments scores don't get added to the list)

I've created a lambda function to join all of these skill values with their column labels:

`skillmerge = lambda row: ' '.join([str(row.index[i])+':'+str(row[i]) for i in range(0,len(row)) if row[i]!=np.nan])`

When I created a single series (1 student) to test on, the lambda function takes less than a second to create the output string in my desired format. However, when I create a dataframe with just 2 rows (again for testing purposes), the function takes several minutes just to complete those 2 rows:

`testing_df['combined_skills'] = testing_df.apply(skillmerge, axis=1)`

Seeing as how I have a couple million students in this dataset, I am looking for way to make this process reliably work faster. Any thoughts on where I can fix this?

Thanks in advance for helping with my first SO question! :D

Answer Source

Using `to_json`

then fixing it

```
def to_str(x):
return x.dropna().to_json(double_precision=0) \
.replace('"', '').replace(',', ' ').strip("{}")
df.T.apply(to_str)
```

Or using list comprehension and `join`

```
def to_str(x):
return " ".join(["{}:{}".format(k, int(v)) for k, v in x.dropna().iteritems()])
df.T.apply(to_str)
```

Both give

```
id
1 skill1:10 skill2:50 skill4:3
2 skill2:10 skill3:2 skill4:70
3 skill1:23 skill3:45 skill5:5
dtype: object
```

*Making your solution work*

```
skillmerge = lambda row: ' '.join([str(row.index[i])+':'+str(row[i]) for i in range(len(row)) if not np.isnan(row[i])])
df.T.apply(skillmerge)
```

Notice that `np.nan == np.nan`

evaluates to `False`

. In order to test for `np.nan`

use `np.isnan`

or `pd.isnull`

or `pd.notnull`

. This fact was throwing off your solution. I replaced it with `not np.isnan`

and it works.

I took the opportunity to do what I'd do because I like it better.