guzzle / guzzle

Guzzle, an extensible PHP HTTP client

Home Page:https://docs.guzzlephp.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add curl handler streaming support

mvorisek opened this issue · comments

Feature request

Based on the discussione below, it seems the main bottleneck is caching the whole response thru disk. This issue is therefore a feature request to add streaming support for curl handler.

Guzzle version(s) affected: any (6.5, 7.5)
PHP version: any
cURL version: any

Description

PHP internal webserver can serve data at >4 GB/s speed. The biggest bottleneck is Guzzle client which seems to be capped around 500 MB/s.

How to reproduce

Server side:

server.php

<?php

$sizeMb = 1 * 1024;
$sizeBytes = $sizeMb * 1024 * 1024;

header('Content-Type: application/octet-stream');
header('Content-Length: ' . $sizeBytes);
header('Content-Disposition: attachment; filename="test.bin"');

$string = str_repeat(bin2hex(random_bytes(128)), 256); // 64 kB
for ($i = $sizeBytes / strlen($string); $i >= 0; $i--) {
    echo $string;
}

cmd:

php -S 127.0.0.1:9687 -t dir_path

Client side:

    public function testHugeOutputStream(): void
    {
        $client = new \GuzzleHttp\Client();
        $sizeMb = 1 * 1024;
        $sizeBytes = $sizeMb * 1024 * 1024;
        $response = $client->request('GET', 'http://127.0.0.1:9687/demos/_unit-test/stream.php');
        static::assertSame(200, $response->getStatusCode());
        static::assertSame('application/octet-stream', $response->getHeaderLine('Content-Type'));
        static::assertSame((string) $sizeBytes, $response->getHeaderLine('Content-Length'));

        $stream = $response->getBody();
        $length = 0;
        while (!$stream->eof()) {
            $length += strlen($stream->read(16 * 1024));
        }
        static::assertSame($sizeBytes, $length);
    }

The test pass, but the typical run time is around 2.2 secs on both Windows and linux. I would expect >2GB/s.

Possible Solution
Maybe the response is internally read in small receive buffers etc., IDK.

Can you replicate this in Guzzle 7? Guzzle 6 is EOL.

Possibly the slow part here is just your choice of size of string to read: $stream->read(16 * 1024)? Did you try setting that much higher?

How long is DNS lookup and getting the connection setup taking? That may be responsible for up to 1 second 100ms of the timing there.

I tried 10GB and 100GB, With 10GB, the speed is about the same. With 100GB I hit php time limit, but the speed would be probably the same too.

With Guzzle v7.5.0/latest, I do not see any significant perfomance improvement. The same/no improvement with 100x larger read chunk size.

Are you able to provide a flame graph so we can determine if this is just PHP being slow, or if we have some bottleneck on our side? Are you using the curl adapter, or the PHP streams adapter?

I am on Windows, so I cannot provide flame graph, on linux, I tested the perf using phpunit and GH Actions, so getting it would be probably not that easy...

I benched the server side with wget I can get > 2GB/s consistently, so the bottleneck has to be on the client side.

Are you using the curl adapter, or the PHP streams adapter?

I have url open disabled, so probably curl (which is enabled). The Guzzle client is created like: new \GuzzleHttp\Client().

Are you able to reproduce it on your side? Is there anything that can be improved in this lib/php-src curl/streams?

I have not had any time to look into this. If you could get a flamegraph going, that would be helpful.

The default curl handler eagerly retrieves the full response body into a php://temp stream before returning the response. php://temp will spill over into a disk-backed file after a certain size (2 MB by default). Thus you are reading the entire response body from net, write it into a file and then read the entire file. You are likely bottlenecked by your disk there

By setting 'stream' => true to use the StreamHandler, instead of the CurlHandler, I achieve 654 MB/s against a local nginx without using TLS. This is bottlenecked by PHP's default chunk size of 8192, resulting in quite a bit of syscall overhead. By calling stream_set_chunk_size($resource, 1024 * 1024) and raising the ->read() size accordingly I was able to achieve 760 MB/s on an 2.4 GHz CPU.

In atk4/ui project we have huge data streaming test: https://github.com/atk4/ui/blob/9ba705b3ced0b99b31b8378b8fdd33774d0647ca/tests/DemosTest.php#L347

I have modified the class to use http by modifying these methods:

    protected function getClient(): Client
    {
        return new Client(['base_uri' => 'http://localhost/vendor/atk4/ui/']);
    }

    protected function getResponseFromRequest(string $path, array $options = []): ResponseInterface
    {
        $options['stream'] = true;
        ...


    public function testHugeOutputStream(): void
    {
        $sizeMb = 5_000; // 5 GB
        ...

With $options['stream'] = true; it seems there is still something transfered via the disk as the disk write spikes to >200MB/s. And the test fails :(.

@TimWolla can you please test this with the atk4/ui test?

Also as allow_url_fopen is very often disabled, can streaming support be added to curl?

can you please test this with the atk4/ui test?

I'm not going to debug your non-trivial code-base for you. I've provided the necessary pointers, the rest is up to you.

By setting 'stream' => true to use the StreamHandler, instead of the CurlHandler

The different output suprised me.

Thank you. As I have written above I need to use curl anyway as allow_url_fopen cannot be enabled. So this is a feature request to add a streaming support for curl. This library is quite popular and there are many usecases where streaming support is needed or at least helpful as caching thru disk is definitely not wanted for repeatable large data transfers.

And the test fails :(

ignore this comment, I was using test server with limited response body size

With CURLOPT_WRITEFUNCTION is should be possible - example usage: https://stackoverflow.com/questions/2294344/what-for-do-we-use-curlopt-writefunction-in-phps-curl

the current impl. already uses it - https://github.com/guzzle/guzzle/blob/7.7.0/src/Handler/CurlFactory.php#L413 - so my question is, what is blocking the curl impl. to allow streaming without intermediate file

In case using cURL is not mandatory for your here is an alternative:

The stream options doesn't seem to be supported by using Guzzle's default cURL handler.

I was able to prevent caching responses as temporary files by configuring it as follows:

use GuzzleHttp\Handler\StreamHandler;
use GuzzleHttp\RequestOptions;
use GuzzleHttp\HandlerStack;
use GuzzleHttp\Client;
...

$streamHandler = new CurlHandler();  
$stack = HandlerStack::create($streamHandler);

$client = new Client([
             // .... (here you can add other options)
             RequestOptions::STREAM => true,
            'handler' => $stack
        ]);

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 2 weeks if no further activity occurs. Thank you for your contributions.