Robson Robson - 3 months ago 12
Apache Configuration Question

Apache Solr Search API default result filters

I'm using Solr with apache nutch to indexing website

My json result looks like this:

"response": {
"numFound": 0,
"start": 0,
"docs": [
{
"id": "http://mysite.pl/cl-BR/link/link",
"url": "http://mysite.pl/cl-BR/link/link",
"content": [
"content"
],
"_version_": 0000
},
{
"id": "http://mysite.pl/ru-RU/link/link",
"url": "http://mysite.pl/ru-RU/link/link",
"content": [
"content"
],
"_version_": 0000
},
{
"id": "http://mysite.pl/en-EN/link/link",
"url": "http://mysite.pl/en-EN/link/link",
"content": [
"content"
],
"_version_": 0000
},


I would like to add parameter to my query, contains information about language into format for example like this:
en-EN

And next return only search result where url contains my parameter.

For example:
My query is:
/solr/CoreName/select?q=you&fl=id,ul,content&urlContains=en-EN


My result is:

"response": {
"numFound": 0,
"start": 0,
"docs": [
{
"id": "http://mysite.pl/en-EN/link/link",
"url": "http://mysite.pl/en-EN/link/link",
"content": [
"content"
],
"_version_": 0000
},


And when my query is:
/solr/CoreName/select?q=you&fl=id,ul,content&urlContains=ru-RU


My result is:

"response": {
"numFound": 0,
"start": 0,
"docs": [
{
"id": "http://mysite.pl/ru-RU/link/link",
"url": "http://mysite.pl/ru-RU/link/link",
"content": [
"content"
],
"_version_": 0000
},


How can i do this?

Answer

The cleanest implementation would be to add a custom field in your schema, and then use copyField to copy the content from url to a url_tokenized field.

<copyField source="url" dest="url_tokenized" />

By using a PatternTokenizer you can tell Solr to split tokens by /, so that you get ru-RU as a token in the url_tokenized field:

<analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="/"/>
</analyzer>

Which should give you something like:

<fieldType name="url_tokenized" class="solr.TextField">
  <analyzer>
    <tokenizer class="solr.PatternTokenizerFactory" pattern="/"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

By adding the LowerCaseFilterFactory we'll make sure that ru-RU and ru-ru both are found regardless of casing used.

Querying would then be done by applying a filter query (fq) to the query string:

...&fq=url_tokenized:ru-ru

This will limit the response to documents that contains "/ru-ru/" somewhere in its URL.

Comments