MAFA MAFA - 4 months ago 24
Linux Question

Linux - Too many closed connections

I'm coding an application opening on a single linux machine 1800 connections/minute using Netty (async nio). A connection lives for a few seconds and then it is closed or it is timeouted after 20 secs if no answer is received. More, the read/write timeout is 30 secs and the request header contains

connection=close
.
After a while (2-3 hours) I get a lot of exceptions in the logs because Netty is unable to create new connections due to a lack of resources.
I increased the max number of open files in limits.conf as:

root hard nofile 200000
root soft nofile 200000


Here is the output of
netstat -nat | awk '{print $6}' | sort | uniq -c | sort -n
:

1 established)
1 FIN_WAIT2
1 Foreign
2 TIME_WAIT
6 LISTEN
739 SYN_SENT
6439 LAST_ACK
6705 CLOSE_WAIT
12484 ESTABLISHED


This is the output of the
ss -s
command:

Total: 194738 (kernel 194975)
TCP: 201128 (estab 13052, closed 174321, orphaned 6477, synrecv 0, timewait 3/0), ports 0

Transport Total IP IPv6
* 194975 - -
RAW 0 0 0
UDP 17 12 5
TCP 26807 8 26799
INET 26824 20 26804
FRAG 0 0 0


Also
ls -l /proc/2448/fd | wc -l
gives about 199K.

That said, the questions are about the closed connections reported in the
ss -s
command output:

1)what are they exactly?

2)why do they keep dangling without being destroyed?

3)Is there any setting (timeout or whatever) which can help to keep them under a reasonable limit?

Answer

1)what are they exactly?

They are sockets that were either never connected or were disconnected and weren't closed.

In Linux, an outgoing TCP socket goes through the following stages (roughly):

  1. You create the socket (unconnected), and kernel allocates a file descriptor for it.
  2. You connect() it to the remote side, establishing a network connection.
  3. You do data transfer (read/write).
  4. When you are done with reading/writing, you shutdown() the socket for both reading and writing, closing the network connection.
  5. You close() the socket, and kernel frees the file descriptor.

So those 174K connections ss reports as closed are sockets that were either not gone past stage 1 (maybe connect() failed or even never called) or gone through stage 4, but not 5. Effectively, they are sockets with underlying open file descriptors, but without any network binding (so netstat / ss listing doesn't show them).

2)why do they keep dangling without being destroyed?

Because nobody called close() on them. I would call it a "file descriptor leak" or a "socket descriptor leak".

3)Is there any setting (timeout or whatever) which can help to keep them under a reasonable limit?

From the Linux point of view, no. You have to explicitly call close() on them (or terminate the process that owns them so the kernel knows they aren't used anymore).

From the Netty/Java point of view, maybe, I don't know.

Essentially, it's a bug in your code, or in Netty code (less likely), or in JRE code (much less likely). You are not releasing the resources when you should. If you show the code, maybe somebody can spot the error.