Cole Robertson - 6 months ago 43

Python Question

I've looked but seem to be coming up dry for an answer to the following question.

I have a pandas dataframe analogous to this (call it 'df'):

`Type Set`

1 theGreen Z

2 andGreen Z

3 yellowRed X

4 roadRed Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which assigns a numerical coding variable (1) if the Type contains the string "Green", (0) otherwise.

Essentially, I'm trying to find a way of doing this:

`df['color'] = np.where(df['Type'] == 'Green', 1, 0)`

Except instead of the usual numpy operators (<,>,==,!=, etc.) I need a way of saying "in" or "contains". Is this possible? Any and all help appreciated!

Answer

Use `str.contains`

:

```
df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
Type Set color
1 theGreen Z 1
2 andGreen Z 1
3 yellowRed X 0
4 roadRed Y 0
```

Another solution with `apply`

:

```
df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
Type Set color
1 theGreen Z 1
2 andGreen Z 1
3 yellowRed X 0
4 roadRed Y 0
```

Second solution is faster, but doesn't work with `NaN`

in column `Type`

, then return `error`

:

TypeError: argument of type 'float' is not iterable

**Timings**:

```
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop
In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop
```