vikky.rk vikky.rk - 3 months ago 32
Java Question

Java DNS resolution hangs forever

I am using the curator framework to connect to a zookeeper server, but running into weird DNS resolution issue. Here is the jstack dump for the thread,

#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72)
- locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder$1)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)


The thread seems to be stuck in the native method and never returns. Also it occurs very randomly, so haven't been able to reproduce consistently. Any ideas?

Answer

We are also trying to solve this problem. Looks like this is due to glibc bug: https://bugzilla.kernel.org/show_bug.cgi?id=99671 or the kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=1209433 depending on who you ask ;)

Also worth reading: https://access.redhat.com/security/cve/cve-2013-7423 and https://alas.aws.amazon.com/ALAS-2015-617.html

To confirm that this is indeed the case attach gdb to the java process:

gdb --pid <JavaProcessPid>

then from gdb:

info threads 

find a thread that does recvmsg:

thread <HangingThreadId>

and then

backtrace 

and if you see something like this then you know that glibc/kernel upgrade will help:

#0  0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1  0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2  0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3  0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4  0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so

Update: Looks like the kernel wins. Please see this thread: http://www.gossamer-threads.com/lists/linux/kernel/2264958 for details. Also there is a tool to verify that your system is affected by the kernel bug you can use this simple program: https://gist.github.com/stevenschlansker/6ad46c5ccb22bc4f3473

to verify:

curl -O pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump

And if the output is:

[26170] glibc: check_pf: netlink socket read timeout
Aborted

Then the system is affected. If the output is something like:

exit success [7618] exit success [7265] exit success

then the system is ok. In the AWS context, upgrading AMIs to (2016.3.2) with the new kernel seems to have fixed the problem.

Comments