vikky.rk vikky.rk - 1 year ago 87
Java Question

Java DNS resolution hangs forever

I am using the curator framework to connect to a zookeeper server, but running into weird DNS resolution issue. Here is the jstack dump for the thread,

#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
at Method)
at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(
at org.apache.zookeeper.client.StaticHostProvider.<init>(
at org.apache.zookeeper.ZooKeeper.<init>(
at org.apache.zookeeper.ZooKeeper.<init>(
at org.apache.zookeeper.ZooKeeper.<init>(
at org.apache.zookeeper.ZooKeeper.<init>(
- locked <0x00000000fd761f40> (a$1)

The thread seems to be stuck in the native method and never returns. Also it occurs very randomly, so haven't been able to reproduce consistently. Any ideas?

Answer Source

We are also trying to solve this problem. Looks like this is due to glibc bug: or the kernel bug: depending on who you ask ;)

Also worth reading: and

To confirm that this is indeed the case attach gdb to the java process:

gdb --pid <JavaProcessPid>

then from gdb:

info threads 

find a thread that does recvmsg:

thread <HangingThreadId>

and then


and if you see something like this then you know that glibc/kernel upgrade will help:

#0  0x00007fc726ff27cd in recvmsg () from /lib64/
#1  0x00007fc727018765 in make_request () from /lib64/
#2  0x00007fc727018b9a in __check_pf () from /lib64/
#3  0x00007fc726fdbd57 in getaddrinfo () from /lib64/
#4  0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-

Update: Looks like the kernel wins. Please see this thread: for details. Also there is a tool to verify that your system is affected by the kernel bug you can use this simple program:

to verify:

curl -O pf_dump.c
gcc pf_dump.c -pthread -o pf_dump

And if the output is:

[26170] glibc: check_pf: netlink socket read timeout

Then the system is affected. If the output is something like:

exit success [7618] exit success [7265] exit success

then the system is ok. In the AWS context, upgrading AMIs to (2016.3.2) with the new kernel seems to have fixed the problem.