Shelari Shelari - 4 months ago 15
Python Question

Elastic doesn't find the last word in the sentence with the dot in the end

I use the Elastic with the following settings:

ES = {
"mappings": {
ES_DOC_TYPE: {
"properties": {
"message": {
"type": "string",
"analyzer": "liza_analyzer",
"include_in_all": False
}
}
}
},
"settings": {
"number_of_shards": 4,
"analysis": {
"tokenizer": {
"liza_tokenizer": {
"type": "pattern",
"pattern": r"(\. )|[\s,\[\]\(\)\"\!\'\?\`\*\;\:\/<>«»\#]+",
"flags": "UNICODE_CASE"
}
},
"analyzer": {
"liza_analyzer": {
"type": "custom",
"tokenizer": "liza_tokenizer",
"filter": ["lowercase"]
}
},
}
}
}


When I try to find a word 'hello' in a sentence 'hello world', the Elastic finds it.

When I try to find a word 'hello' in a sentence 'hello. world', the Elastic finds it.

When I try to find a word 'hello' in a sentence 'hello', the Elastic finds it too.

But when I try to find the word 'hello' in a sentence 'hello.' (with the dot in the end), the Elastic doesn't find it.

At the same time the tokens for the two last sentences looks like

{
"tokens": [{
"token": "hello",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
}]
}


(they are identical)

The question is: why does it happens? How can I fix it?

Answer

Your pattern is wrong. It should be:

"pattern": "(\.\s*)|[\s,\[\]\(\)\"\!\'\?\`\*\;\:\/<>«»\#]+"