masch82 masch82 - 3 years ago 69
Java Question

Does Java LongAdder's increment() & sum() prevent getting the same value twice?

Currently I am using

AtomicLong
as a synchronized counter in my application, but I have found that with high concurrency/contention, e.g. with 8 threads my throughput is much lower (75% lower) then single-threaded for obvious reasons (e.g. concurrent CAS).

Use case:
A counter variable which


  • is updated by multiple threads concurrently

  • has high write contention, basically every usage in a thread will consist of a write with an immediate read afterwards

  • Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value.
    It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.



So I tried to replace
AtomicLong
with a
LongAdder
, and indeed it looks from my measurements that my throughput with 8 threads is much better - (only) about 20% lower than single-threaded (compared to 75%).

However I'm not sure I correctly understand the way
LongAdder
works.
The JavaDoc says:


This class is usually preferable to AtomicLong when multiple threads
update a common sum that is used for purposes such as collecting
statistics, not for fine-grained synchronization control.


and for
sum()



Returns the current sum. The returned value is NOT an atomic snapshot;
invocation in the absence of concurrent updates returns an accurate
result, but concurrent updates that occur while the sum is being
calculated might not be incorporated.


What is meant by fine-grained synchronization control ...
From looking at this so question and the source of
AtomicLong
and
Striped64
, I think I understand that if the update on an
AtomicLong
is blocked because of a CAS instruction issued by another thread, the update is stored thread-local and accumulated later to get some eventual consistency. So without further synchronization and because the
incrementAndGet()
in
LongAdder
is not atomic but two instructions, I fear the following is possible:

private static final LongAdder counter = new LongAdder(); // == 0
// no further synchronisation happening in java code
Thread#1 : counter.increment();
Thread#2 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#2 : counter.sum(); // == 1
Thread#3 : counter.increment(); // CAS T#1 still ongoing, storing +1 thread-locally
Thread#3 : counter.sum(); // == 1
Thread#1 : counter.sum(); // == 3 (after merging everything)


If this is possible,
AtomicLong
is not really suitable for my use case, which probably then counts as "fine-grained synchronization control".

And then with my write/read^n pattern I probably can't do better then
AtomicLong
?

Answer Source

LongAdder is definitely not suitable for your use case of unique integer generation, but you don't need to understand the implementation or dig into the intricacies of the java memory model to determine that. Just look at the API: it has no compound "increment and get" type methods that would allow you to increment the value and get the old/new value back, atomically.

In terms of adding values, it only offers void add(long x) and void increment() methods, but these don't return a value. You mention:

the incrementAndGet in LongAdder is not atomic

... but I don't see incrementAndGet at all in LongAdder. Where are you looking?

Your idea of:

  • usage in a thread will consist of a w rite with an immediate read afterwards
  • Requirement is that each read from the counter (immediately after the writing) gets a unique incremented value. It is not required that each retrieved counter value is increasing in the same order as the different threads(writers) increment the value.

Doesn't work even for AtomicLong, unless by "write followed by a read" you mean calling the incrementAndGet method. I think it goes without saying that two separate calls on an AtomicLong or LongAdder (or any other object really) can never be atomic without some external locking.

So the Java doc, in my opinion, is a bit confusing. Yes, you should not use sum() for synchronization control, and yes "concurrent updates that occur while the sum is being calculated might not be incorporated"; however, the same is true of AtomicLong and its get() method. Increments that occur while calling get() similarly may or may not be reflected in the value returned by get().

Now there are some guarantees that are weaker with LongAdder compared to AtomicLong. One guarantee you get with AtomicLong is that a series of operations transition the object though a specific series of values, and where there is no guarantee on what specific value a thread will see, all the values should come from the true set of transition values.

For example, consider starting with an AtomicLong with value zero, and two threads incrementing it concurrently, by 1 and 3 respetively. The final value will always be 4, and only two possible transition paths are possible: 0 -> 1 -> 4 or 0 -> 3 -> 4. For a given execution, only one of those can have occurred and all concurrent reads will be consistent with that execution. That is, if any thread reads a 1, then no thread may read a 3 and vice versa (of course, there is no guarantee that any thread will see a 1 or 3 at all, they may all see 0 or 4.

LongCounter doesn't provide that guarantee. Since the write process is not locked, and the read process adds together several values in a not-atomic fashion, it is possible for one thread to see a 1 and another to see a 3 in the same execution. Of course, it still doesn't synthesize "fake" values - you should never read a "2" for example.

Now that's a bit of a subtle concept and the Javadoc doesn't get it across well. They go with a pretty weak and not particularly formal statement instead. Finally, I don't think you can observe the behavior above with pure increments (rather than additions) since there is only one path then: 0 -> 1 -> 2 -> 3, etc. So for increments, I think AtomicLong.get() and LongCounter.sum() have pretty much the same guarantees.

Something Useful

OK, so I'll give you something that might be useful. You can still implement what you want for efficiently, as long as you don't have strict requirements on the exact relationship between the counter value each thread gets and the order they were read.

Re-purpose the LongAdder Idea

You could make the LongAdder idea work fine for unique counter generation. The underlying idea of LongAdder is to spread the counter into N distinct counters (which live on separate cache lines). Any given call updates one of those counters based on the current thread ID2, and a read needs to sum the values from all counters. This means that writes have low contention, at the cost of a bit more complexity, and at a large cost to reads.

Now way the write works by design doesn't let you read the full LongAdder value, but since you just want a unique value you could use the same code except with the top or bottom N bits3 set uniquely per counter.

Now the write can return the prior value, like getAndIncrement and it will be unique because the fixed bits keep it unique among all counters in that object.

Thread-local Counters

A very fast and simple way is to use a unique value per thread, and a thread-local counter. When the thread local is initialized, it gets a unique ID from a shared counter (only once per thread), and then you combine that ID with a thread-local counter - for example, the bottom 24-bits for the ID, and the top 40-bits for the local counter1. This should be very fast, and more importantly essentially zero contention.

The downside is that the values of the counters won't have any specific relationship among threads (although they may still be strictly increasing within a thread). For example, a thread which has recently requested a counter value may get a much smaller one than a long existing value. You haven't described how you'll use these so I don't know if it is a problem.

Also, you don't have a single place to read the "total" number of counters allocated - you have to examine all the local counters to do that. This is doable if your application requires it (and has some of the same caveats as the LongAdder.sum() function).

A different solution, if you want the numbers to be "generally increasing with time" across threads, and know that every thread requests counter values reasonably frequently, is to use a single global counter, which threads request a local "allocation" of a number of IDs, from which it will then allocate individual IDs in a thread-local manner. For example, threads may request 10 IDs, so that three threads will be allocated the range 0-9, 10-19, and 20-29, etc. They then allocate out of that range until it is exhausted and which point they go back to the global counter. This is similar to how memory allocators carve out chunks of a common pool which can then be allocated thread-local.

The example above will keep the IDs roughly in increasing order over time, and each threads IDs will be strictly increasing as well. It doesn't offer any strict guarantees though: a thread that is allocated the range 0-9, could very well sleep for hours after using 0, and then use "1" when the counters on other threads are much higher. It would reduce contention by a factor of 10.

There are a variety of other approaches you could use and mostof them trade-off contention reduction versus the "accuracy" of the counter assignment versus real time. If you had access to the hardware, you could probably use a quickly incrementing clock like the cycle counter (e.g., rdtscp) and the core ID to get a unique value that is very closely tied to realtime (assuming the OS is synchronizing the counters).


1 The bit-field sizes should be chosen carefully based on the expected number of threads and per-thread increments in your application. In general, if you are constantly creating new threads and your application is long-lived, you may want to err on the side of more bits to the thread ID, since you can always detect a wrap of the local counter and get a new thread ID, so bits allocated to the thread ID can be efficiently shared with the local counters (but not the other way around).

2 The optimal is to use the 'CPU ID', but that's not directly accessible in Java (and even at the assembly level there is no fast and portable way to get it, AFAIK) - so the thread ID is used as a proxy.

3 Where N is lg2(number of counters).

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download