Andy Andy - 1 month ago 16
Java Question

GAE Objectify massive? import

I need to import about 1,000,000 records into Datastore. What's more, I want to combine some of these records into a single one. Everything I've tried so far takes forever and isn't really fit to recover if the backend terminates the task halfway to restart it on another machine.

My first attempt was to query Datastore before every insert to add data to an existing matching record or insert a new one otherwise.

Crops local = // read from CSV
for (...)
{
Crops db = ObjectifyService.ofy().load().type(Crops.class).
id(local.country + "_" + local.cropType + "_" +
new Integer(local.year).toString()).now();

if (db == null)
{
db = local;
crops.put(composite, db);
}
else
{
// add additional data to db
}
ObjectifyService.ofy().save().entity(db).now();
}


The estimated time for this to finish was 13 hours.

So I tried to aggregate the data locally

Crops local = // read from CSV
HashMap<String, Crops> crops = ...
for (...)
{
String composite = local.country + "_" + local.cropType + "_" +
new Integer(local.year).toString();
Crops db = crops.get(composite);

if (db == null)
{
db = local;
crops.put(composite, db);
}
else
{
// add additional data to db
}
}
ObjectifyService.ofy().save().entities(crops.values()).now();


This lead to termination of the program due to the heap getting too big.

A variant that I got to work was to split the aggregated data into chunks of 1000 records for storing them.

Iterator<Crops> sit = crops.values().iterator();
List<Crops> list = new ArrayList<Crops>(1000);
i = 0;
while (sit.hasNext())
{
list.add(sit.next());
i++;
if (i >= 1000)
{
ObjectifyService.ofy().save().entities(list).now();
list.clear();
i = 0;
}
}
ObjectifyService.ofy().save().entities(list).now();


But the estimated time for this to finish is 80 hours.

The next thing that I want to try is to insert these chunks of 1000 in parallel instead of sequentially.

But before I waste many more hours on this I wanted to ask if I'm on the right path or I'm going about it all wrong. Maybe it's not possible to get such an import below 13 hours?

tl;dr



What's the fastest way of importing large datasets into Datastore?

Answer
  1. Take a look at MapReduce - it is specifically designed for massive jobs that can be split in smaller chunks.

  2. There is no need to check if an entity already exists, unless there is some data in this entity that will be lost if you overwrite it. If it can be safely overwritten, just insert your entities. This should cut your time in half or more.

  3. Batching database calls will significantly speed it up.

  4. I don't know what type local.year is, but if it is int, you can simply do:

    String composite = local.country + "_" + local.cropType + "_" + local.year;