RoryB RoryB - 6 months ago 38
Java Question

solr StandardTokenizer: how is underscore processed with wildcards?

So, I have a solr instance which processes inputs and queries using StandardTokenizer (as well as ClassicFilterfactory, LowercaseFilterFactory and Stopfilterfactory)

In my index are a number of files with underscore separated names (eg. some_indexed_file.jpg).

I've noticed that if I query for "some_indexed_file.jpg", I get the file I'm looking for returned correctly.

However, if I alternatively search for "some_indexed_file.jp*", (that's with an asterisk, which I am presuming is acting as a wildcard) which, by my understanding should produce similar results, I get no results.

Any idea what's going on: I assume I'm misunderstanding something about the way solr processes queries?

edit: as requested, here are the schema XML configuration entries:

<fieldType name="default" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ClassicFilterFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
</analyzer>
</fieldType>



<field name="filename" type="default" multiValued="true" omitNorms="false" termVectors="false"/>

Answer

Well, a bit more research has solved the problem: The base issue is that Solr doesn't apply text analysis to wildcard queries.

This meant that it was searching for an exact match to some_indexed_file.jp*. However, when the filename was indexed, it was tokenised into "some" "indexed" and file.jpg, which does not match this search term.
Searching for some_indexed_file.jpg was being tokenised properly, and therefore returning the right results.

Comments