native-tls TLS streams are not guaranteed to adhere to AsyncWrite due to TLS libraries incorrectly reporting the amount of consumed bytes

Question

native-tls TLS streams are not guaranteed to adhere to AsyncWrite due to TLS libraries incorrectly reporting the amount of consumed bytes

Matthias247 opened this issue 4 years ago · comments

The AsyncWrite::poll_write method is defined to return

Poll::Ready(Ok(n)) if n bytes of data have been immediately written
Poll::Pending means that no data was written from the buffer provided

This means that if call to poll_write returns Ready(Ok(3)) it should be guaranteed that 3 bytes have been written and the remaining buf.len() - 3 bytes have not been touched. Based on that assumption, an application could present a different buffer the in the next poll_write call, and expect content from that buffer to be correctly transmitted.

This behavior is currently not guaranteed if the native-tls implementations are utilized, due to how some TLS libraries are behaving in case of non-blocking IO:

When the TLS libraries public write API is called, it isn't aware yet how many byte can later on be actually written into the socket. Therefore it at this point in time will try to copy as much bytes as possible (e.g. N) from the user supplied buffer into its internal buffer, and create and encrypt a TLS record. At this point M bytes have been copied, creating a TLS record of size M+K. After this is done, the TLS library tried to flush the created record to the socket. This might however lead to a partial write (e.g of M+K-5, which equals to N-5).

Now the TLS library could theoretically return N as the return value of the write call, since all data had been buffered. However if it would do that, the application wouldn't be aware that not all data was written, and wouldn't e.g. be aware that a new epoll registration would be necessary. Therefore some of the libraries return a value < N towards the application, in order to make sure the write call is repeated if the socket gets ready for writing again. On the next write the first set of bytes is skipped, since those already have been copied into a record, and only new bytes after a certain offset will really be taken into account.

This means that if an application presents different data on the next write call, a certain amount of bytes at the beginning of the buffer might be ignored/skipped, and only the remaining buffer would be written.

If an application relies on being able to dynamically change the data before each poll_write call, it might end up with a corrupted stream. This could e.g. happen with a HTTP/2 library which creates frames in an on-demand fashion before each poll_write call, and where an event between 2 poll_write calls - like closing a Stream - could lead to the previous data being discarded in favor of newer data.

TLS libraries therefore sometimes document that representing the same data on each write is necessary. E.g. openssl provides the following warning for SSL_write:

When a write function call has to be repeated because SSL_get_error(3) returned SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be repeated with the same arguments. The data that was passed might have been partially processed. When SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER was set using SSL_CTX_set_mode(3) the pointer can be different, but the data and length should still be the same.

The same was reported for s2n.

Applications which purely make use of write_all APIs should not encounter any issues.

The rustls version of tokio-tls does not have the problem, since it always reports all bytes have been copied from the input buffer into the TLS buffer. It might therefore not be able to report that some bytes are "stuck" in the TLS session, and haven't been flushed. But since it has a dedicated poll_flush method that is expected and ok.

Potential fixes

Only offer async fn write_all(&[u8]) for native TLS instead of AsyncWrite
Improve documentation to highlight the issue