SYU88 SYU88 - 19 days ago 8
JSON Question

Custom analyzer appearing in type mapping but not working in Elasticsearch

I'm trying to add a custom analyzer to my index while also mapping that analyzer to a property on a type. Here is my JSON object for doing this:

{ "settings" : {
"analysis" : {
"analyzer" : {
"test_analyzer" : {
"type" : "custom",
"tokenizer": "standard",
"filter" : ["lowercase", "asciifolding"],
"char_filter": ["html_strip"]
}
}
}
},
"mappings" : {
"test" : {
"properties" : {
"checkanalyzer" : {
"type" : "string",
"analyzer" : "test_analyzer"
}
}
}
}
}


I know this analyzer works because I've tested it using
/wp2/_analyze?analyzer=test_analyzer -d '<p>Testing analyzer.</p>'
and also it shows up as the analyzer for the checkanalyzer property when I check
/wp2/test/_mapping
. However, if I add a document like
{"checkanalyzer": "<p>The tags should not show up</p>"}
, the HTML tags don't get stripped out when I retrieve the document using the
_search
endpoint. Am I misunderstanding how the mapping works or is there something wrong with my JSON object? I'm dynamically creating the wp2 index and also the test type when I make this call to Elasticsearch, not sure if that matters.

Answer

The html doesn't get removed from the source, it gets removed from the terms generated by that source. You can see this if you use a terms aggregation:

POST /test_index/_search
{
    "aggs": {
        "checkanalyzer_field_terms": {
            "terms": {
                "field": "checkanalyzer"
            }
        }
    }
}

{
   "took": 77,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "test_index",
            "_type": "test",
            "_id": "1",
            "_score": 1,
            "_source": {
               "checkanalyzer": "<p>The tags should not show up</p>"
            }
         }
      ]
   },
   "aggregations": {
      "checkanalyzer_field_terms": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "not",
               "doc_count": 1
            },
            {
               "key": "should",
               "doc_count": 1
            },
            {
               "key": "show",
               "doc_count": 1
            },
            {
               "key": "tags",
               "doc_count": 1
            },
            {
               "key": "the",
               "doc_count": 1
            },
            {
               "key": "up",
               "doc_count": 1
            }
         ]
      }
   }
}

Here's some code I used to test it:

http://sense.qbox.io/gist/2971767aa0f5949510fa0669dad6729bbcdf8570