Kin Cheung Kin Cheung - 1 year ago 79
Java Question

JVM crashes when calling JNI function during gc

We have a Java application that has a JNI layer that is multi-threaded (pthread) and will call back to the Java level upon messages received from the underlying network.

We notice that every time it crashes, it is caused by a gc. We can even simulate such a crash by manually trigger a gc by calling

jmap -histo <pid>
while the JNI layer is receiving messages from the network.

Given the information that we have read about the behaviours in JVM during GC in this post, http://stackoverflow.com/a/39401467/4523221, we still couldn't figure out why such crash is related to gc since JNI function calls are blocked during gc.

If anyone can shed light on this, it will be great. Thanks in advance.

The following is a stack trace that we have collected after a crash in our application.

Program terminated with signal 6, Aborted.
#0 0x0000003cdce325e5 in raise () from /lib64/libc.so.6
#1 0x0000003cdce33dc5 in abort () from /lib64/libc.so.6
#2 0x00007fdafe2516b5 in os::abort(bool) () from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#3 0x00007fdafe3efbf3 in VMError::report_and_die() ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#4 0x00007fdafde2f3e2 in report_vm_error(char const*, int, char const*, char const*) ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#5 0x00007fdafe24c1ff in os::PlatformEvent::park() ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#6 0x00007fdafe20c538 in Monitor::ILock(Thread*) ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#7 0x00007fdafe20c73f in Monitor::lock_without_safepoint_check() ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#8 0x00007fdafe2e7a1f in SafepointSynchronize::block(JavaThread*) ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#9 0x00007fdafe39bcdd in JavaThread::check_safepoint_and_suspend_for_native_trans(JavaThread*) ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#10 0x00007fdafe0123d8 in jni_NewByteArray ()
from /usr/java/jdk1.8.0_65/jre/lib/amd64/server/libjvm.so
#11 0x00007fdaa447b7d1 in JNIEnv_::NewByteArray (this=0x7fdaf800c9f8, len=7)
at /usr/java/jdk1.8.0_65/include/jni.h:1643
---omitted---


The way we obtained JNIEnv*
e.g.

JNIEnv *env = 0;
jint result = jvm->GetEnv((void **) &env, JNI_VERSION_1_8);
if (result != JNI_OK) {
result = jvm->AttachCurrentThread((void **) &env, NULL);

Answer Source

After spending days investigating this JNI issue, we have finally found out the reason and I would like to share our experience here so that hopefully it will help others.

First of all, the reason why we needed to use JNI in the first place was because we needed to make use of a 3rd party network library that was a Linux native lib, and unfortunately that was the cause of our problem.

The library provided us a callback handle that we implemented to receive incoming network messages from it, and this callback, we later found out, was simply a signal handler. So, it means that this signal handler would get called whenever a signal popped up, even during gc.

Since C threads keep running during safepoints in JVM, it would have been fine if those C threads weren't attached to the JVM, otherwise disasters would certainly strike.

Here is kind of what we thought had happened. (everything below happened in the JNI layer)

  1. App starts. We init and cached JNI resources, e.g. Jmv*, Method ID, etc.
  2. We registered a C function to the library to receive messages. The C function is a function that would call JNI APIs to allocate memory to accommodate the received messages and pass them onto Java. After that, we just started to wait for incoming messages.
  3. When a message finally arrived, the C function mentioned above was called to handle the message, but wait... what was this thread that's handling the callback. That would have been the main thread or hmm... any available threads.
  4. As taught in any JNI textbook, we attached the thread to JVM first if not yet done so before calling any JNI APIs. Great!
  5. Now, during a GC, all Java threads were blocked, but the C layer was still running. At this critical moment, if a message arrived, some thread (any threads) was called up to handle the message. But what threads were still available during gc? All application threads were blocked and the only ones that were still running at this moment (our guess) were unfortunately the gc threads.

The gdb stacktrace that we were seeing was basically what happened when a gc thread that was actually in a middle of doing some work on the heap and then got a call from our application to do some application work and then a few JNI API calls... BOOM

Solution:

  • Have a C thread that handles the library callback
  • Never attach that thread to JVM
  • Have other threads attach to the JVM to do the Native-Java transition.

p.s. maybe some of the details weren't exactly accurate, so any JVM expert advice is welcomed. I will try to correct them as advised.

Thanks

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download