agourlay / dlm

Minimal HTTP download manager

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[feature request] use reqwest's connection pool?

opened this issue · comments

Thanks for making a fast and simple file downloader. For my use case (downloading millions of files) it's the first I found that works, ie the more popular aria2c just runs out of memory when you pass it such a big file list.

In src/client.rs, I removed the .pool_max_idle_per_host(0); as I am downloading many files from the same host, so it is a lot faster let reqwest keep it's connection pool, rather than have it close and open a new connection for every download.

Probably the most typical use case is downloading files from many different hosts, in which case you don't gain anything from keeping a connection pool open, but I also doubt it has a measurable cost.

Especially if you download many small files from the same host, it gives a big speedup to reuse the connection, so maybe it makes sense to be the default? or exposed as a cli flag users can use? or maybe ideal but most work, would be reading the input file, and if the hosts are all different then add the .pool_max_idle_per_host(0); to the builder and otherwise leave it off and sort or otherwise group the input file by host, so it can reuse the connection when requesting multiple files from the same host

Hi @amulepeweichan 👋 ,

thank you for the kind words about dlm, I am happy it is handling your use case properly!

as I am downloading many files from the same host, so it is a lot faster let reqwest keep it's connection pool

Is it something you have actually witnessed when changing the code locally?

AFAIK, having pool_max_idle_per_host set to zero does not disable the connection pooling per se.
It is instructing reqwest to not keep idle requests around.

An idle request is defined via the following configuration knob:

    /// Set an optional timeout for idle sockets being kept-alive.
    ///
    /// Pass `None` to disable timeout.
    ///
    /// Default is 90 seconds.
    pub fn pool_idle_timeout<D>(mut self, val: D) -> ClientBuilder

So the current behavior is, I believe, that all connections idle for at least 90 seconds are terminated.

The number of concurrent downloads is set at the application level, so there is always the same amount of connections used, with no time to becoming idle. A connection is reused right away at the end of a download for the next file.

This is my mental model for the current internals of dlm.

I am happy to change things if this appears to not reflect your experience.

I realized that my answer does not cover the case where multiple hosts are targeted.

In that case, depending on the order of the links in the input files, connections could be recreated.

However, keeping a potentially unbounded number of idle connections open is not something desirable at scale.

A practical workaround is to sort the links in the input file by host to ensure a best utilization of the warm connections.

You're right I should definitely benchmark & test it to confirm. I ran some tests locally, just by adding to my nginx config outside the server block:

log_format connections '[$time_local] "$request" $connection $connection_requests';

and inside the server block:

access_log /var/log/nginx/connections.log connections;

Now the last number in each log line is the number of times the connection has been reused. I then in my document root did for i in `seq 1 256`; do echo $i > $i.txt; done.

Then I made input file for dlm: for i in `seq 1 256`; do echo http://localhost/$i.txt >> filelist.txt; done
Then I ran time dlm -i filelist.txt -o out/ -M 1 2>/dev/null > /dev/null

In the nginx log it shows it is making a new connection for every request.
I also did the same with my build of dlm without .pool_max_idle_per_host(0)

Now in the nginx log it shows it is reusing each connection for 100 requests before making a new one. I don't know if the limit of 100 is from nginx or reqwest.

I ran each several times (doing rm out/*) between runs, and the official build takes consistently 1 second, while the build without .pool_max_idle_per_host(0) takes consistently 0.5 seconds.

And that is for a server running on localhost, I assume for a webserver across the internet where there is more latency in reconnecting, the speed difference will be more.

The number of concurrent downloads is set at the application level, so there is always the same amount of connections used, with no time to becoming idle. A connection is reused right away at the end of a download for the next file.

I think with idle time of 0, it is closed as soon as the request ends, before the next request is made, even if the next request is made immediately after. Maybe an idle time of 1 would keep it open for the next request if it's to the same host, while keeping the open connection pool small if making requests to different hosts.

Another unrelated thing I did, that's small so I'll just mention here rather than open another issue, is enable compression.
I added to Cargo.toml reqwest = { version = "0.11.11", features = ["gzip"] }

Probably brotli compression is better and faster but in my case the server doesn't support it. It's not needed to make any changes to the code, when reqwest is built with that feature it sends accept-encoding gzip header by default.

For my current downloads it has sped it up by 20%.

Thank you for your investigation 👍

Given the time you have spent on this issue, I have decided to remove the .pool_max_idle_per_host(0) constraint and rely on Reqwest's defaults.

Regarding the gzip feature, I am happy to enable it if it helps.

EDIT: I took the liberty to edit your messages due to formatting issues