sourcediver.org

2023/03/12

Keeping the TCP Window a byte open

Maximilian Güntner

Over the weekend, I was adding TCP support into cannelloni and I used a pattern that only consumes as many bytes as needed for the next step while reading from the socket. This is done by using a decoder with a state machine that returns the amount of bytes that need to be present until the decoder can be called again to decode the next segment.

Reading exactly n bytes

A quote from the man page of read(2):

On success, the number of bytes read is returned (zero indicates
end of file), and the file position is advanced by this number.
It is not an error if this number is smaller than the number of
bytes requested; this may happen for example because fewer bytes
are actually available right now (maybe because we were close to
end-of-file, or because we are reading from a pipe, or from a
terminal), or because read() was interrupted by a signal.

This means one needs to pay extra attention on how many bytes have been read, not only how many bytes can be read.

Since cannelloni is written in C++ and designed to run on Linux machines, I am using ioctl and FIONREAD to know how many bytes are availabe in the read buffer of the socket. If the decoder has advertised more bytes than are currently present, the process waits briefly so that more bytes arrive. The whole process looks like this:

while (1) {
  /* [...] */
  /* check whether we can read enough bytes */
  ssize_t expectedBytes = decoder.expectedBytes;
  int available;
  if (ioctl(socket, FIONREAD, &available) == -1) {
    lerror << "ioctl failed" << std::endl;
    disconnect();
    continue;
  } else if (available > 0 && available < static_cast<int>(expectedBytes)) {
    /* not enough bytes are available, let's wait a bit */
    std::this_thread::sleep_for(std::chrono::milliseconds(20));
    continue;
  }

  receivedBytes = read(socket, buffer, expectedBytes);
  if (receivedBytes < 0) {
    lerror << "recvfrom error." << std::endl;
    /* close connection */
    disconnect();
    continue;
  } else if (receivedBytes == 0){
    disconnect();
    continue;
  }
  /* [...] */
  decoder.expectedBytes = decodeFrame(buffer, receivedBytes, &decoder.tempFrame, &decoder.state);
  /* [...] */
}

So far so good. The code above will not read less than expectedBytes from socket which is something that needs to be accounted for.

Load testing closes the window shut

It worked fine under normal operation but when doing load testing using cangen vcan0 -c 1 -v -g 0 > /dev/null which generates 30-40 Mbit/s of CAN traffic on my laptop things broke quite fast. No data would flow between the two instances of cannelloni and no frames were bridged between the two SocketCAN interfaces. Even the TCP connection itself was silent besides of regular keep-alives.

The receiving end would actually wait for more bytes to be available on the socket, i.e. 2 bytes instead of the expected 4 would be in the receive buffer.

This was also visible when inspecting the connection state using ss -t - but what was more confusing were the several hundered kilobyte of data in the Send-Q of the sender. What was going on? The one side was waiting for just a few more bytes of data while the other side was sitting on several hundered kilobytes! I started wireshark and starred at the output.

wireshark logs of TCP zero window condition being reached

Wireshark log of TCP Zero condition reached

The output shows that at some point a zero window is announced by the receiver. This happens when the receiver can not accept more data which typically occurs when the application is not consuming data fast enough and the receive buffer (ss: Recv-Q) fills up. But ss -t clearly shows an almost empty queue instead of a full one! So I replace all the code above by a simple loop that reads up to 10 bytes from the socket and then waits 10 milliseconds. This is really inefficient: a perfect bottle neck. With this I could reproduce the zero window condition without the decoder blocking the connection forever or ending up in weird states in the protocol. Monitoring the socket again with ss -t, I was able to see that the receiver queue would only fill up once Recv-Q was really close or actually zero.

Keeping the window open

With the assumption that the Recv-Q actually needs to be empty before the zero window condition on that connection is lifted again by the Kernel, I fixed the TCP_WINDOW_CLAMP to 1 which means that the window will never be zero.

const int min_window_size = 1;
if (setsockopt(m_socket, IPPROTO_TCP, TCP_WINDOW_CLAMP, &min_window_size, sizeof(min_window_size))) {
  lerror << "Could not set window size to " << min_window_size << std::endl;
}

The receiving end is now able to fill the receive buffer byte-by-byte until the expectedBytes can be read, which fully resolved the dead lock on localhost while stress testing!

Theory

Most TCP congestion control algorithms like Reno, New Reno or CUBIC use the Round Trip Time as a key metric to measure the latency between two network endpoints.

The RTT affects a lot of variables of the connection like the estimated bandwidth of the connection, retransmission time-outs, window size and possibly also the threshold when a once full queue is deemed empty enough to accept new data.

Since the RTT on localhost should be close to zero, the threshold of the Recv-Q should also be relativly close to zero if RTT is a factor in the calculation.

I couldn’t find reliable information on the underlying algorithms without reading the Kernel source code, if you know more, shoot me an e-mail and I will update this blog post.