Trexion Kameha Trexion Kameha - 1 month ago 13
Python Question

Python: Storing multiple dictionaries after replacing categoricals with integers

My data looks like this:

source browser sex age country class
SEO Chrome M 39 Japan 0
Ads Chrome F 53 United States 0
SEO Opera M 53 United States 1
SEO Safari M 41 NULL 0
Ads Safari M 45 United States 0
Ads Chrome M 18 Canada 0


In trying to get it ready for machine learning, I wrote a function to replace categoricals with integers:

def str2int(data):
y2= data
S = set(y2) #set
D = dict(zip(S, range(len(S)))) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y


I then call it using the below to convert all string columns to integers:

cols=['sex','browser','country','source']
for col in cols:
df_fraud[col] = convert_str_int(df_fraud[col])


I would like to store the dictionary associated with each column and call it later, which I could simply say "return Y, D" in the def function, but I am not sure how I would include it in my for function below.

Frankly, I am not sure what the best way to store these references in dictionaries are and am open to suggestions.

I have simplified the example below:

This is not working when using the suggested code. Any ideas?

def str2int(data):
y2= data
S = set(y2) #set
D = dict( zip(S, range(len(S))) ) # assign each string an integer, and put it in a dict
Y = [D[y2_] for y2_ in y2] # store class labels as ints
return Y, D

def make_str2int(data):
categories = set(data)
return dict(zip(categories, range(len(categories))))

raw_data = {
'names': ['A','B','B','D','D','E','B','B','E','F'],
'gender': ['M','F','F','F','F','M','M','M','M','M']}

str2int={}
cols = ['names', 'gender']
for col in cols:
str2int[col] = make_str2int(df_fraud[col])

bli bli
Answer

I haven't tested, and I'm not sure to understand exactly how you intend to use the dictionaries, but here are my suggestions.

You could store the dictionaries in a dictionary of dictionaries:

def make_str2int(data):
    categories = set(data) 
    return dict(zip(categories, range(len(categories))

str2int = {}
cols = ['sex', 'browser', 'country', 'source']
for col in cols:
    str2int[col] = make_str2int(df_fraud[col])

(Assuming df_fraud represents your table (you didn't make this clear in your question.))

And then, if you want the categories existing in one column col, you can call:

str2int[col].keys()

If you want the corresponding numbers:

str2int[col].values()

If you want the number associated to a categorical value cat_val in a known column col:

str2int[col][cat_val]

Edit: Applying on your raw_data example

def make_str2int(data):
    categories = set(data) 
    return dict(zip(categories, range(len(categories))))

raw_data = {
    'names': ['A','B','B','D','D','E','B','B','E','F'],
    'gender': ['M','F','F','F','F','M','M','M','M','M']} 

str2int={}
cols = raw_data.keys()
for col in cols:
    str2int[col] = make_str2int(raw_data[col])

print "Conversion examples:"

element = raw_data['names'][3]
print "%s -> %s" % (element, str2int['names'][element])

element = raw_data['gender'][2]
print "%s -> %s" % (element, str2int['gender'][element])

Output:

Conversion examples:
D -> 3
F -> 1