PhDeveloper PhDeveloper - 8 months ago 41
Java Question

Weka how to predict new unseen Instance using Java Code?

I wrote a WEKA java code to train 4 classifiers. I saved the classifiers models and want to use them to predict new unseen instances (think about it as someone who wants to test whether a tweet is positive or negative).

I used StringToWordsVector filter on the training data. And to avoid the "Src and Dest differ in # of attributes" error I used the following code to train the filter using the trained data before applying the filter on the new instance to try and predict whether a new instance is positive or negative. And I just can't get it right.

Classifier cls = (Classifier)"models/myModel.model"); //reading one of the trained classifiers

BufferedReader datafile = readDataFile("Tweets/tone1.ARFF"); //read training data

Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);

Filter filter = new StringToWordVector(50);//keep 50 words
Instances filteredData = Filter.useFilter(data, filter);

// rebuild classifier

String testInstance= "Text that I want to use as an unseen instance and predict whether it's positive or negative";
System.out.println(">create test instance");
FastVector attributes = new FastVector(2);
attributes.addElement(new Attribute("text", (FastVector) null));

// Add class attribute.
FastVector classValues = new FastVector(2);

attributes.addElement(new Attribute("Tone", classValues));
// Create dataset with initial capacity of 100, and set index of class.
Instances tests = new Instances("test istance", attributes, 100);
tests.setClassIndex(tests.numAttributes() - 1);

Instance test = new Instance(2);
// Set value for message attribute
Attribute messageAtt = tests.attribute("text");
test.setValue(messageAtt, messageAtt.addStringValue(testInstance));


Filter filter2 = new StringToWordVector(50);
Instances filteredTests = Filter.useFilter(tests, filter2);

System.out.println(">train Test filter using training data");
Standardize sfilter = new Standardize(); //Match the number of attributes between src and dest.
sfilter.setInputFormat(filteredData); // initializing the filter with training set
filteredTests = Filter.useFilter(filteredData, sfilter); // create new test set

ArffSaver saver = new ArffSaver(); //save test data to ARFF file
File unseenFile = new File ("Tweets/unseen.ARFF");

When I try to Standardize the Input data using the filtered training data I get a new ARFF file (unseen.ARFF) but with 2000 (same number of training data) instances where most of the values are negative. I don't understand why or how to remove those instances.

System.out.println(">Evaluation"); //without the following 2 lines I get ArrayIndexOutOfBoundException.
filteredData.setClassIndex(filteredData.numAttributes() - 1);
filteredTests.setClassIndex(filteredTests.numAttributes() - 1);

Evaluation eval = new Evaluation(filteredData);
eval.evaluateModel(cls, filteredTests);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));

Printing the evaluation results I want to see for example a percentage of how positive or negative this instance is but instead I get the following. I also want to see 1 instance instead of 2000. Any help on how to do this will be great.

> Results

Correlation coefficient 0.0285
Mean absolute error 0.8765
Root mean squared error 1.2185
Relative absolute error 409.4123 %
Root relative squared error 121.8754 %
Total Number of Instances 2000



I have reached a good solution and here I share my code with you. This trains a classifier using WEKA Java code then use it to predict new unseen instances. Some parts - like paths - are hardcoded but you can easily modify the method to take parameters.

* This method performs classification of unseen instance. 
* It starts by training a model using a selection of classifiers then classifiy new unlabled instances.

    public static void predict() throws Exception {
        //start by providing the paths for your training and testing ARFF files make sure both files have the same structure and the exact classes in the header

        //initialise classifier
        Classifier classifier = null;

        System.out.println("read training arff");

        Instances train = new Instances(new BufferedReader(new FileReader("Train.arff")));
        train.setClassIndex(0);//in my case the class was the first attribute thus zero otherwise it's the number of attributes -1

        System.out.println("read testing arff");
        Instances unlabeled = new Instances(new BufferedReader(new FileReader("Test.arff")));

        // training using a collection of classifiers (NaiveBayes, SMO (AKA SVM), KNN and Decision trees.)
        String[] algorithms = {"nb","smo","knn","j48"};
        for(int w=0; w<algorithms.length;w++){
            classifier = new NaiveBayes();
            classifier = new SMO();
            classifier = new IBk();
            classifier = new J48();

        System.out.println("training using " + algorithms[w] + " classifier");

        Evaluation eval = new Evaluation(train);
        //perform 10 fold cross validation
        eval.crossValidateModel(classifier, train, 10, new Random(1));
        String output = eval.toSummaryString();

        String classDetails = eval.toClassDetailsString();


        Instances labeled = new Instances(unlabeled);

        // label instances (use the trained classifier to classify new unseen instances)
        for (int i = 0; i < unlabeled.numInstances(); i++) {
            double clsLabel = classifier.classifyInstance(unlabeled.instance(i));
            System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));

            //save the model for future use
            ObjectOutputStream out = new ObjectOutputStream(new FileOutputStream("myModel.dat"));
            System.out.println("===== Saved model =====");