srytoomanyquestions srytoomanyquestions - 1 year ago 69
Python Question

SVM; training data doesn't contain target

I'm trying to predict whether a fan is going to turn out to a sporting event or not. My data (pandas DataFrame) consists of fan information (demographic's, etc.), and whether or not they attended the last 10 matches (g1_attend - g10_attend).

fan_info age neighborhood g1_attend g2_attend ... g1_neigh_turnout
2717 22 downtown 0 1 .47
2219 67 east side 1 1 .78

How can I predict if they're going to attend g11_attend, when g11_attend doesn't exist in the DataFrame?

Originally, I was going to look into applying some of the basic models in scikit-learn for classification, and possibly just add a g11_attend column into the DataFrame. This all has me quite confused for some reason. I'm thinking now that it would be more appropriate to treat this as a time-series, and was looking into other models.

Answer Source

You are correct, you can't just add a new category (ie output class) to a classifier -- this requires something that does time series.

But there is a fairly standard technique for using a classifier on times-series. Asserting (conditional) Time Independence, and using windowing.

In short we are going to make the assumption that whether or not someone attends a game depends only on variables we have captured, and not on some other time factor (or other factor in general). i.e. we assume we can translate their history of games attended around the year and it will still be the same probability. This is clearly wrong, but we do it anyway because machine learning techneques will deal with some noised in the data. It is clearly wrong because some people are going to avoid games in winter cos it is too cold etc.

So now on the the classifier:

We have inputs, and we want just one output. So the basic idea is that we are going to train a model, that given as input whether they attended the first 9 games, predicts if they will attend the 10th

So out inputs are 1 age, neighbourhood, g1_attend, g2_attend,... g9_attend and the output is g10_attend -- a binary value.

This gives us training data.

Then when it it time to test it we move everything accross: switch g1_attend for g2_attend, and g2_attend for g3_attend and ... and g9_attend for g10_attend. And then our prediction output will be for g11_attend.

You can also train several models with different window sizes. Eg only looking at the last 2 games, to predict attendance of the 3rd. This gives you a lot more trainind data, since you can do. g1,g2->g3 and g2,g3->g4 etc for each row.

You could train a bundle of different window sizes and merge the results with some ensemble technique.

In particular it is a good idea to train g1,...,g8-> g9, and then use that to predict g10 (using g2,...,g9 as inputs) to check if it is working.

I suggest in future you may like to ask these questions on Cross Validated. While this may be on topic on stack overflow, it is more on topic there, and has a lot more statisticians and machine learning experts.

1 I suggest discarding fan_id for now as an input. I just don't think it will get you anywhere, but it is beyond this question to explain why.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download