I have a requirement to enable TCP keepalive on any connections and now I am struggling with the results from our test case. I think this is because I do not really understand when the first keepalive probe is sent. I read the following in the documentation for
the interval between the last data packet sent (simple ACKs are not considered data) and the first keepalive probe; after the
connection is marked to need keepalive, this counter is not used any
I have a connection where data is only sent from a server to a client at rather high rates.
Then you'll never see keepalives. Keepalives are sent when there is "silence on the wire". RFC1122 has some explanation re keepalives.
A "keep-alive" mechanism periodically probes the other end of a connection when the connection is otherwise idle, even when there is no data to be sent
Back to your question:
Some other sources state that this is the time a connection is idle, but they do not further define what this means.
This is how long TCP will wait before poking the peer "hoy! still alive?".
$ cat /proc/sys/net/ipv4/tcp_keepalive_time 7200
In other words, you've been using a TCP connection and it has been great. However, for the past 2 hours there hasn't been anything to send. Is it reasonable to assume the connection is still alive? Is it reasonable to assume all the middleboxes in the middle still have state about your connection? Opinions vary and keepalives aren't part of RFC793.
The TCP specification does not include a keep-alive mechanism it could: (1) cause perfectly good connections to break during transient Internet failures; (2) consume unnecessary bandwidth ("if no one is using the connection, who cares if it is still good?")
To test keepalive, we unplugged the cable on the client's NIC.
This isn't testing keepalive. This is testing your TCPs retransmit strategy, i.e. how many times and how often TCP will try to get your message across. On a Linux box this (likely) ends up testing
How may times to retry before killing alive TCP connection. RFC 1122 says that the limit should be longer than 100 sec. It is too small number. Default value 15 corresponds to 13-30min depending on RTO.
But see also RFC5482 for more ways to influence it.
Is it correct that keep alive probes are not sent during retransmission
It makes sense: TCP is already trying to elicit a response from the other peer, an empty keepalive would be superfluous.