James James - 6 months ago 42
Java Question

Classifier4j output is flawed

I'm working through a book on machine learning and they give an example of how to check string input to see if a word is likely to be a misspelling of another word (different spellings of a celebrities name in this case). After running the example all output is either 0.0, .999, or 0.7071067811865475. I looked through the API and it's suposed to be able to give a range of values between 0 and 1 but I couldn't find anything to explain this issue. I know there are probably other tools out there to do this same thing but i would like to get this tool working properly. here is the code I used to test it.

import java.util.ArrayList;
import java.util.List;

import net.sf.classifier4J.ClassifierException;
import net.sf.classifier4J.vector.HashMapTermVectorStorage;
import net.sf.classifier4J.vector.TermVectorStorage;
import net.sf.classifier4J.vector.VectorClassifier;

public class BritneyDilemma {

public BritneyDilemma() {
List<String> terms = new ArrayList<String>();
terms.add("brittany spears");
terms.add("brittney spears");
terms.add("britany spears");
terms.add("britny spears");
terms.add("briteny spears");
terms.add("britteny spears");
terms.add("briney spears");
terms.add("brittny spears");
terms.add("brintey spears");
terms.add("britanny spears");
terms.add("britiny spears");
terms.add("britnet spears");
terms.add("britiney spears");
terms.add("britney spears");
terms.add("britney spearssssss");
terms.add("britne spessssss");

TermVectorStorage storage = new HashMapTermVectorStorage();
VectorClassifier vc = new VectorClassifier(storage);
String correctString = "britney spears";
for (String term : terms) {
try {
vc.teachMatch("britCatagory", correctString);
double result = vc.classify("britCatagory", term);
System.out.println(term + " = " + result);
catch (ClassifierException e) {

public static void main(String[] args) {
BritneyDilemma bd = new BritneyDilemma();


VectorClassifier basically splits the incoming string into tokens (words) and checks wether or not they appear in the correct strings (i.e. the strings that were passed to teachMatch method), also considering the frequency of these words. VectorClassifier does not calculate the likeliness of words. You can check it yourself if you look at it's source code.

You have the correct last name "spears" in every string, and the name "britney" messed up in most of the strings, so VectorClassifier finds one matching word and one non-matching word. For the string "britney spearssssss" it also finds one matching word (name "britney" in this case) and one non-matching word (last name). That's why, for all of these strings, VectorClassifier gives identical results.

For the string that matches exactly ("britney spears"), it gives the best score that is close to 1.

For the string that has no matching words ("britne spessssss") it gives zero.