native-tls TLS streams are not guaranteed to adhere to AsyncWrite due to TLS libraries incorrectly reporting the amount of consumed bytes
Matthias247 opened this issue · comments
The AsyncWrite::poll_write
method is defined to return
Poll::Ready(Ok(n))
if n bytes of data have been immediately writtenPoll::Pending
means that no data was written from the buffer provided
This means that if call to poll_write
returns Ready(Ok(3))
it should be guaranteed that 3 bytes have been written and the remaining buf.len() - 3
bytes have not been touched. Based on that assumption, an application could present a different buffer the in the next poll_write
call, and expect content from that buffer to be correctly transmitted.
This behavior is currently not guaranteed if the native-tls
implementations are utilized, due to how some TLS libraries are behaving in case of non-blocking IO:
When the TLS libraries public write
API is called, it isn't aware yet how many byte can later on be actually written into the socket. Therefore it at this point in time will try to copy as much bytes as possible (e.g. N
) from the user supplied buffer into its internal buffer, and create and encrypt a TLS record. At this point M
bytes have been copied, creating a TLS record of size M+K
. After this is done, the TLS library tried to flush the created record to the socket. This might however lead to a partial write (e.g of M+K-5
, which equals to N-5
).
Now the TLS library could theoretically return N
as the return value of the write
call, since all data had been buffered. However if it would do that, the application wouldn't be aware that not all data was written, and wouldn't e.g. be aware that a new epoll
registration would be necessary. Therefore some of the libraries return a value < N
towards the application, in order to make sure the write
call is repeated if the socket gets ready for writing again. On the next write the first set of bytes is skipped, since those already have been copied into a record, and only new bytes after a certain offset will really be taken into account.
This means that if an application presents different data on the next write
call, a certain amount of bytes at the beginning of the buffer might be ignored/skipped, and only the remaining buffer would be written.
If an application relies on being able to dynamically change the data before each poll_write
call, it might end up with a corrupted stream. This could e.g. happen with a HTTP/2 library which creates frames in an on-demand fashion before each poll_write
call, and where an event between 2 poll_write
calls - like closing a Stream - could lead to the previous data being discarded in favor of newer data.
TLS libraries therefore sometimes document that representing the same data on each write
is necessary. E.g. openssl provides the following warning for SSL_write:
When a write function call has to be repeated because SSL_get_error(3) returned SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE, it must be repeated with the same arguments. The data that was passed might have been partially processed. When SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER was set using SSL_CTX_set_mode(3) the pointer can be different, but the data and length should still be the same.
The same was reported for s2n.
Applications which purely make use of write_all
APIs should not encounter any issues.
The rustls version of tokio-tls
does not have the problem, since it always reports all bytes have been copied from the input buffer into the TLS buffer. It might therefore not be able to report that some bytes are "stuck" in the TLS session, and haven't been flushed. But since it has a dedicated poll_flush
method that is expected and ok.
Potential fixes
- Only offer
async fn write_all(&[u8])
for native TLS instead ofAsyncWrite
- Improve documentation to highlight the issue