Zteve Zteve - 1 year ago 36
Java Question

Lucene Index problems with "-" character

I'm having trouble with a Lucene Index, which has indexed words, that contain "-" Characters.

It works for some words that contain "-" but not for all and I don't find the reason, why it's not working.

The field I'm searching in, is analyzed and contains version of the word with and without the "-" character.

I'm using the analyzer: org.apache.lucene.analysis.standard.StandardAnalyzer

here an example:

if I search for "gsx-*" I got a result, the indexed field contains

but if I search for "v-*" I got no result. The indexed field of the expected result contains:

If I search for "v-strom" without "*" it works, but if I just search for "v-str" for example I don't get the result. (There should be a result because it's for a live search for a webshop)

So, what's the difference between the 2 expected results? why does it work for "gsx-" but not for "v-" ?

Answer Source

StandardAnalyzer will treat the hyphen as whitespace, I believe. So it turns your query "gsx-*" into "gsx*" and "v-*" into nothing because at also eliminates single-letter tokens. What you see as the field contents in the search result is the stored value of the field, which is completely independent of the terms that were indexed for that field.

So what you want is for "v-strom" as a whole to be an indexed term. StandardAnalyzer is not suited to this kind of text. Maybe have a go with the WhitespaceAnalyzer or SimpleAnalyzer. If that still doesn't cut it, you also have the option of throwing together your own analyzer, or just starting off those two mentined and composing them with further TokenFilters. A very good explanation is given in the Lucene Analysis package Javadoc.

BTW there's no need to enter all the variants in the index, like V-strom, V-Strom, etc. The idea is for the same analyzer to normalize all these variants to the same string both in the index and while parsing the query.