user1499144 - 5 months ago 34

Python Question

I'm using scikit-learn in Python to develop a classification algorithm to predict gender of a certain customers. Amongst others I want to use the Naive Bayes classifier but my problem is that I have a mix of categorial data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernouilli Naive Bayes can be used for categorial data. However, since I want to have **both** categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

Answer

You have at least two options:

Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with

`predict_proba`

method) as new features:`np.hstack((multinomial_probas, gaussian_probas))`

and then refit a new model (e.g. a new gaussian NB) on the new features.