WillemM WillemM - 2 months ago 18
JSON Question

Elastic Search: aggregation, count by field

I inserted this data into elastic search:

[
{ "name": "Cassandra Irwin", "location": "Monzon de Campos" .. },
{ "name": "Gayle Mooney", "location": "Villarroya del Campo" .. },
{ "name": "Angelita Charles", "location": "Revenga de Campos" .. },
{ "name": "Sheppard Sweet", "location": "Santiago del Campo" .. },
..
..


Sidenote: to reproduce:
1) download: http://wmo.co/20160928_es_query/bulk.json
2) execute: curl -s -XPOST 'http://localhost:9200/testing/external/_bulk?pretty' --data-binary @bulk.json

Question: obtain a count of how many records there are per "location".

Solution 1: bucket aggregation .. doesn't give the desired results

curl -s -XPOST 'localhost:9200/testing/_search?pretty' -d '
{
"aggs": { "location_count": { "terms": { "field":"location", "size":100 }}}
}' | jq '.aggregations'


Result:

{"location_count":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,
"buckets":[
{"key":"campo", "doc_count":47},
{"key":"del", "doc_count":47},
{"key":"campos", "doc_count":29},
{"key":"de", "doc_count":29},
{"key":"villarroya","doc_count":15},
{"key":"torre", "doc_count":12},
{"key":"monzon", "doc_count":11},
{"key":"santiago", "doc_count":11},
{"key":"pina", "doc_count":9},
{"key":"revenga", "doc_count":9},
{"key":"uleila", "doc_count":9}
]}}


Problem: it splits the 'location' fields into words, and returns a doc count per word.

Solution 2: desired results, but performance worries.

I can do it using this query, pulling out ALL locations and doing the aggregation in jq (the every handy JSON cli-tool),
but this can turn into a performance nightmare when applied to huge volumes of data :

curl -s -XPOST 'localhost:9200/testing/_search?pretty' -d '
{
"query": { "wildcard": { "location": "*" } }, "size":1000,
"_source": ["location"]
}' | jq '[.hits.hits[] |
{location:._source.location,"count":1}] |
group_by(.location) |
map({ key: .[0].location, value: map(.count)|add })'


Result:

[
{ "key": "Monzon de Campos", "value": 11 },
{ "key": "Pina de Campos", "value": 9 },
{ "key": "Revenga de Campos", "value": 9 },
{ "key": "Santiago del Campo", "value": 11 },
{ "key": "Torre del Campo", "value": 12 },
{ "key": "Uleila del Campo", "value": 9 },
{ "key": "Villarroya del Campo", "value": 15 }
]


This is the exact result that I want.

QUESTION: how can I obtain the same results via elastic search query?
(ie. with the aggregation handled by elastic search, and not by jq)

Val Val
Answer

You need to add a not_analyzed sub-field to your location field.

First modify your mapping like this:

curl -XPOST 'http://localhost:9200/testing/_mapping/external' -d '{
   "properties": {
      "location": {
         "type": "string",
         "fields": {
            "raw": {
               "type": "string",
               "index": "not_analyzed"
            }
         }
      }
   }
}'

Then reindex your data again:

curl -s -XPOST 'http://localhost:9200/testing/external/_bulk?pretty' --data-binary @bulk.json

Finally, you'll be able to run your query like this (on the location.raw field) and get the results you expect:

curl -s -XPOST 'localhost:9200/testing/_search?pretty' -d '
{
  "aggs": {  "location_count": { "terms": { "field":"location.raw",   "size":100 }}}
}' | jq  '.aggregations'