Depressio Depressio - 5 months ago 45
Java Question

Multimaps in ChronicleMap

There is definitely a disclaimer on ChronicleMap's GitHub about Multimaps in ChronicleMap:


Chronicle Map is not...

... No secondary indexes.

A multimap. Using a
ChronicleMap<K, Collection<V>>
as multimap is technically possible, but often leads to problems...


Unfortunately, this is one of my use cases and using off-heap storage for that (with ChronicleMap) would certainly be the easiest route to do so.

Let me try to explain my problem with pizzas. I have 100,000 different pizzas. Each pizza has an ID and lots of different toppings and crusts. I have three access patterns:


  • Give me the pizza by ID.

  • Give me all the pizzas that have a particular topping.

  • Give me all the pizzas that have a particular crust.



I can store the pizzas easily using a
ChronicleMap<UUID,Pizza>
. But that's only one access pattern. I don't want to have to iterate through every pizza to find the ones with a matching topping or crust. So, I'd want to store something like
ChronicleMap<Topping,Collection<UUID>>
and
ChronicleMap<Crust,Collection<UUID>>
.

Then, if someone asks me for all the pizzas with pepperoni, I look up in the topping ChronicleMap to get the UUIDs of the matching pizzas, then in the main pizza map.

But the documentation quoted above scares me. Does anyone know what these "problems" such a thing often leads to might be? Why should I not do this, even though it seems to be working for me? Does it have something to do with how ChronicleMap stores the serialized objects, specifically the collection?

A few additional notes for potential questions:


  1. We may add pizzas later that will require updating the collections, too.

  2. Many processes are trying to do these operations, hence the need for sharing the map via ChronicleMap instead of just a basic ConcurrentMap.


Answer

If actual data indeed resemble pizzas, toppings and crusts, i. e. there are only a handful of distinct toppings/crusts, and thousands of pizzas contain each of them, I would say that having a proper multimap is overkill for this case and you would better have pepperoni_pizzas.dat, onions_pizzas.dat, ... distinct appendable shared lists with UUIDs, you can use Chronicle Queue for access and update them from multiple processes conveniently.

If there are 10s-100s of thousands of toppings/crusts, only 10s-100s of pizzas have a specific topping on average, you should use multimap indeed.

Essentially there are 3 kinds of "problems" with Chronicle-Maps-as-multimaps:

Excessive garbage allocation on each query

If you create a Chronicle Map with List<UUID> or Set<UUID> type of value without specifying custom value serializers, it will work, but it will be utterly inefficient, because it will default to built-in Java serialization for serializing and deserializing the whole value collection on each request, without reusing neither collection heap objects, nor individual UUID heap objects for elements. Hence a lot of garbage will be generated on each request to the ChronicleMap.

Solution However, if you specify value serializer as ListMarshaller or SetMarshaller (or your custom collection marshaller, which you could write based on ListMarshaller and SetMarshaller implementations) in conjunction with reusable UUID heap object, it will resolve this garbage problem:

ListMarshaller<ReusableUuid> valueMarshaller = ListMarshaller.of(
     ReusableUuidReader.INSTANCE, ReusableUuidWriter.INSTANCE);
List<ReusableUuid> averageValue = Stream
    .generate(() -> ReusableUuid.random())
    .limit(averagePizzasForTopping)
    .collect(Collectors.toList());
 ChronicleMap<Topping, List<ReusableUuid>> map = ChronicleMap
     .of(Topping.class, (Class<List<ReusableUuid>>) (Class) List.class)
     .averageKey(pepperoni)
     .valueMarshaller(valueMarshaller)
     .averageValue(averageValue)
     .entries(numberOfToppings)
     .createPersistedTo(new File("toppings_to_pizza_ids.dat"));

Inefficient value updates and replication

When you append another pizza UUID to a list of 100 UUIDs, and insert the new value back to Chronicle Map, Chronicle Map will re-write the whole list again, instead of appending one UUID to the end of off-heap memory chunk. And if you use replication, it will send the whole list of 100 UUIDs as an updated value to other nodes, instead of sending just one added UUID.

Both (value update and replication) could be optimized via terrible hacks, but it requires very deep knowledge of Chronicle Map implementation and will be very fragile.

Fragmentation of Chronicle-Map's memory

If you plan add new pizzas during data store lifetime, memory areas initially allocated for entires will become too small to hold new values with more UUIDs, so memory areas will be re-allocated (possibly several times for each list of UUIDs). Chronicle Map's data stucture design implies simplified memory allocation scheme, which suffers badly from fragmentation, if entries are re-allocated many times.

If you have a lot of UUIDs in lists, and you run your application on Linux, you can mitigate this problem by pre-allocating a lot of memory (more than will practically be needed by any list) for each entry (by specifying .actualChunkSize() in ChronicleMapBuilder configuration) and relying on Linux's feature of lazy mapped memory allocation (page-by-page, as needed). So you will lose at most 4KB of memory for each UUID list, that might be OK if lists are many KBs of size.

On the other hand, if your lists are so long (and they are lists of UUIDs i. e. small structures), and you have only 100 000 pizzas in total, you don't need multimap in the first place, see the beginning of this answer.

The trick with memory overcommit and relying on lazy mapped memory allocation in Linux also would work for short list (collections) of values, but only if elements themselves are big, so that the average total value size is many KBs.

Fragmentation is also less an issue when you can avoid entry memory re-allocation any other way, i. e. new pizza UUIDs are added in time but removed as well, so topping-to-uuids list sizes float around some average and re-allocation is rarely hitted.

Memory fragmentation is never an issue if values are never updated (or never change in size) after entry is inserted into Chronicle Map.

Conclusion

In some use cases and with proper configuration, Chronicle Map could serve as a multimap well. In other cases Chronicle Map as multimap is inherently inefficient.

Factors that matter:

  • Total number of key -> List<Value> entries in a multimap
  • Total number of values
  • Average and distribution of key sizes
  • Average and distribution of distinct value sizes
  • Average and distribution of value list sizes
  • Value lists dynamics over Chronicle Map lifetime (never updated, append only, remove and append. Removes from beginning and middle of lists are more expensive.)
  • If Chronicle Map is replicated, or not
Comments