I'm trying to predict whether a fan is going to turn out to a sporting event or not. My data (pandas DataFrame) consists of fan information (demographic's, etc.), and whether or not they attended the last 10 matches (g1_attend - g10_attend).
fan_info age neighborhood g1_attend g2_attend ... g1_neigh_turnout
2717 22 downtown 0 1 .47
2219 67 east side 1 1 .78
You are correct, you can't just add a new category (ie output class) to a classifier -- this requires something that does time series.
But there is a fairly standard technique for using a classifier on times-series. Asserting (conditional) Time Independence, and using windowing.
In short we are going to make the assumption that whether or not someone attends a game depends only on variables we have captured, and not on some other time factor (or other factor in general). i.e. we assume we can translate their history of games attended around the year and it will still be the same probability. This is clearly wrong, but we do it anyway because machine learning techneques will deal with some noised in the data. It is clearly wrong because some people are going to avoid games in winter cos it is too cold etc.
So now on the the classifier:
We have inputs, and we want just one output. So the basic idea is that we are going to train a model, that given as input whether they attended the first 9 games, predicts if they will attend the 10th
So out inputs are 1
and the output is
g10_attend -- a binary value.
This gives us training data.
Then when it it time to test it we move everything accross: switch
g3_attend and ... and
And then our prediction output will be for
You can also train several models with different window sizes.
Eg only looking at the last 2 games, to predict attendance of the 3rd.
This gives you a lot more trainind data, since you can do.
g4 etc for each row.
You could train a bundle of different window sizes and merge the results with some ensemble technique.
In particular it is a good idea to train
and then use that to predict
g2,...,g9 as inputs)
to check if it is working.
I suggest in future you may like to ask these questions on Cross Validated. While this may be on topic on stack overflow, it is more on topic there, and has a lot more statisticians and machine learning experts.
1 I suggest discarding
fan_id for now as an input. I just don't think it will get you anywhere, but it is beyond this question to explain why.