user6461192 user6461192 - 1 month ago 12
Python Question

Getting a list of suffixes from the company names

I have a data frame df with a column name - Company. Few examples of the company names are: ABC Inc., XYZ Gmbh, PQR Ltd, JKL Limited etc. I want a list of all the suffixes (Inc.,Gmbh, Ltd., Limited etc). Please notice that suffix length is always different. There might be companies without any suffix, for example: Apple. I need a complete list of all suffixes from the all the company names, keeping only unique suffixes in the list.

How do I accomplish this task?

Answer

try this:

In [36]: df
Out[36]:
         Company
0         Google
1      Apple Inc
2  Microsoft Inc
3       ABC Inc.
4       XYZ Gmbh
5        PQR Ltd
6    JKL Limited

In [37]: df.Company.str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[37]: array(['Inc', 'Inc.', 'Gmbh', 'Ltd', 'Limited'], dtype=object)

or ignoring punctuation:

In [38]: import string

In [39]: df.Company.str.replace('['+string.punctuation+']+','')
Out[39]:
0           Google
1        Apple Inc
2    Microsoft Inc
3          ABC Inc
4         XYZ Gmbh
5          PQR Ltd
6      JKL Limited
Name: Company, dtype: object

In [40]: df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
Out[40]: array(['Inc', 'Gmbh', 'Ltd', 'Limited'], dtype=object)

export result into Excel file:

data = df.Company.str.replace('['+string.punctuation+']+','').str.extract(r'\s+([^\s]+$)', expand=False).dropna().unique()
res = pd.DataFrame(data, columns=['Comp_suffix'])
res.to_excel(r'/path/to/file.xlsx', index=False)