John K John K - 2 months ago 17
Groovy Question

How to take the shortest distance per person (with multiple addresses) to an origin point and sort on that value

I have People documents in my elastic index and each person has multiple addresses, each address has a lat/long point associated.

I'd like to geo sort all the people by proximity to a specific origin location, however multiple locations per person complicates this matter. What has been decided is [Objective:] to take the shortest distance per person to the origin point and use that number as the sort number.

Example of my people index roughed out in 'pseudo-JSON' showing a couple of person documents each having multiple addresses:

person {
name: John Smith
addresses [
{ lat: 43.5234, lon: 32.5432, 1 Main St. }
{ lat: 44.983, lon: 37.3432, 2 Queen St. W. }
{ ... more addresses ... }
]
}

person {
name: Jane Doe
addresses [
... she has a bunch of addresses too ...
]
}

... many more people docs each having multiple addresses like above ...


Currently I'm using an elastic script field with an inline groovy script like so - it uses a groovy script to calculate meters from origin for each address, shoves all those meter distances into an array per person and picks the minimum number from the array per person making it the sort value.

string groovyShortestDistanceMetersSortScript = string.Format("[doc['geo1'].distance({0}, {1}), doc['geo2'].distance({0}, {1})].min()",
origin.Latitude,
origin.Longitude);

var shortestMetersSort = new SortDescriptor<Person>()
.Script(sd => sd
.Type("number")
.Script(script => script
.Inline(groovyShortestDistanceMetersSortScript)
)
.Order(SortOrder.Ascending)
);


Although this works, I wonder if using a scripted field might be more expensive or too complex at querying time, and if there is a better way to achieve the desired sort order outcome by indexing the data differently and/or by using aggregations, maybe even doing away with the script field altogether.

Any thoughts and guidance are appreciated. I'm sure somebody else has run into this same requirement (or similar) and has found a different or better solution.

I'm using the Nest API in this code sample but will gladly accept answers in elasticsearch JSON format because I can port those into the NEST API code.

Answer

When sorting on distance from a specified origin where the field being sorted on contains a collection of values (in this case geo_point types), we can specify how a value should be collected from the collection using the sort_mode. In this case, we can specify a sort_mode of "min" to use the nearest location to the origin as the sort value. Here's an example

public class Person
{
    public string Name { get; set; }
    public IList<Address> Addresses { get; set; }
}

public class Address
{
    public string Name { get; set; }
    public GeoLocation Location { get; set; }
}

void Main()
{
    var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
    var indexName = "people";
    var connectionSettings = new ConnectionSettings(pool)
            .InferMappingFor<Person>(m => m.IndexName(indexName));

    var client = new ElasticClient(connectionSettings);

    if (client.IndexExists(indexName).Exists)
        client.DeleteIndex(indexName);

    client.CreateIndex(indexName, c => c
        .Settings(s => s
            .NumberOfShards(1)
            .NumberOfReplicas(0)
        )
        .Mappings(m => m
            .Map<Person>(mm => mm
                .AutoMap()
                .Properties(p => p
                    .Nested<Address>(n => n
                        .Name(nn => nn.Addresses.First().Location)
                        .AutoMap()
                    )
                )
            )
        )
    );

    var people = new[] {
        new Person {
            Name = "John Smith",
            Addresses = new List<Address>
            {
                new Address 
                {
                    Name = "Buckingham Palace",
                    Location = new GeoLocation(51.501476, -0.140634)
                },
                new Address
                {
                    Name = "Empire State Building",
                    Location = new GeoLocation(40.748817, -73.985428)
                }
            }
        },
        new Person {
            Name = "Jane Doe",
            Addresses = new List<Address>
            {
                new Address
                {
                    Name = "Eiffel Tower",
                    Location = new GeoLocation(48.858257, 2.294511)
                },
                new Address
                {
                    Name = "Uluru",
                    Location = new GeoLocation(-25.383333, 131.083333)
                }
            }
        }
    };

    client.IndexMany(people);

    // call refresh for testing (avoid in production)
    client.Refresh("people");

    var towerOfLondon = new GeoLocation(51.507313, -0.074308);

    client.Search<Person>(s => s
        .MatchAll()
        .Sort(so => so
            .GeoDistance(g => g
                .Field(f => f.Addresses.First().Location)
                .PinTo(towerOfLondon)
                .Ascending()
                .Unit(DistanceUnit.Meters)
                // Take the minimum address location distance from
                // our target location, The Tower of London
                .Mode(SortMode.Min)
            )
        )
    );
}

This creates the following search

{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_geo_distance": {
        "addresses.location": [
          {
            "lat": 51.507313,
            "lon": -0.074308
          }
        ],
        "order": "asc",
        "mode": "min",
        "unit": "m"
      }
    }
  ]
}

which returns

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : null,
    "hits" : [ {
      "_index" : "people",
      "_type" : "person",
      "_id" : "AVcxBKuPlWTRBymPa4yT",
      "_score" : null,
      "_source" : {
        "name" : "John Smith",
        "addresses" : [ {
          "name" : "Buckingham Palace",
          "location" : {
            "lat" : 51.501476,
            "lon" : -0.140634
          }
        }, {
          "name" : "Empire State Building",
          "location" : {
            "lat" : 40.748817,
            "lon" : -73.985428
          }
        } ]
      },
      "sort" : [ 4632.035195223564 ]
    }, {
      "_index" : "people",
      "_type" : "person",
      "_id" : "AVcxBKuPlWTRBymPa4yU",
      "_score" : null,
      "_source" : {
        "name" : "Jane Doe",
        "addresses" : [ {
          "name" : "Eiffel Tower",
          "location" : {
            "lat" : 48.858257,
            "lon" : 2.294511
          }
        }, {
          "name" : "Uluru",
          "location" : {
            "lat" : -25.383333,
            "lon" : 131.083333
          }
        } ]
      },
      "sort" : [ 339100.6843074794 ]
    } ]
  }
}

The value returned in the sort array for each hit is the minimum distance in the sort unit specified (in our case, metres) from the specified point (The Tower of London) and the addresses for each person.

Per the guidelines in Sorting By Distance documentation, often it can make more sense to score by distance, which can be achieved by using function_score query with a decay function.

Comments