amphp / amp

A non-blocking concurrency framework for PHP applications. 🐘

Home Page:https://amphp.org/amp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Getting access to child process data on SIGCHLD in Loop::onSignal.

whataboutpereira opened this issue · comments

I'm playing around with launching processes in a fiber, suspending the fiber and then restarting the fiber to collect results on SIGCHLD signal based on the child pid/exit code in pcntl_signal(SIGCHLD, ... ).

I don't see a way to access the terminating child PID in Loop::onSignal() however. Is it possible? It would be a nice feature to have otherwise.

With the event loop I'm now resuming all fibers on SIGCHLD in Loop::onSignal and suspending them again immediately inside the fiber.

It seems like you're using Amp v2. If you want to use fibers, I'd strongly suggest to take a look at Amp v3 instead. It has fiber support built-in. I'd also suggest taking a look at https://github.com/amphp/process for running child processes:
https://github.com/amphp/process/blob/1a37f978ef515e672eaf0c88601eb7aaf3510588/examples/ping-many.php

I actually eyed v3 the other day, but was put off by lack of docs for the newer versions. :) Still trying to wrap my head around all the async.

I have a process that's currently using forked PHP processes to scrape 130k iterations worth of data with curl multi (<30 requests per item) and then store it in MySQL using PDO. I was forced to switch SSL on for the database connection recently and it turned out PDO is leaking memory with SSL when the main process needs to open/close it's database handle per every fork it makes. So now I'm working on doing away with forks.

I've experimented with amphp/parallel, but I couldn't figure out how to have a lean rolling worker queue that I could stop at the first error, should a child process run into problems. Promise\all etc want to go in batches and with Promise\first I could stop the queue only when all workers had errored, so I could be running a long time with 1 functional worker. :)

The ideal scenario would be:

  • Get the 130k ids from the database, stick them into a generator and use the generator to loop through them.
  • Have a worker pool of say 4 processes and check the results one by one as they come in.
  • Queue another one once any worker finishes and stop the queue if a worker encounters a problem (letting the other 3 workers finish).

This is a perfect use case for Amp and its components, you probably don't even need multiple worker processes for that.

I'm currently writing docs for AMPHP v3 at https://v3.amphp.org/, so every input helps on what's currently hard to find out.

This is a perfect use case for Amp and its components, you probably don't even need multiple worker processes for that.

* You can fetch the IDs using https://github.com/amphp/mysql from the database.

* You can use [`amphp/pipeline`](https://github.com/amphp/pipeline/blob/597747610fb5ce9322db88d9e65ef5a060d593a4/examples/concurrent.php) to run all your tasks with a limited concurrency.

* You can use https://github.com/amphp/http-client instead of multi curl to run your HTTP requests.

* And use `amphp/mysql` again to persist the results back to the database.

That would indeed be the ultimate goal. :)

I managed to get a worker pool working now (parallel v2) and it's marvelous, I hadn't realized we can use the same processes over and over again via channels.

I still having problems with consuming results on the first ready basis. :) Pipeline seems to preserve the order and the whole queue will stall if I happen on a task that's slower for some reason.

    $pool = Worker\workerPool(new DefaultWorkerPool($workers));

    $queue = new Queue($workers);

    $pipeline = $queue->pipe();

    /** @var int[] $leagues */
    $leagues = getLeagues($mode);

    async(function () use ($pool, $queue, $leagues, $mode): void {
        foreach ($leagues as $id) {
            $task = new ScannerTask($id, $mode);
            $queue->push($pool->submit($task)->getResult());
        }
        $queue->complete();
    });

    foreach ($pipeline as $task) {
        /** @var MZLive\Scanner\Result $result */
        $result = $task->await();
        printf("League [%d] - SQL: %.4fs cURL: %.4fs RAM: %.2fM\n",
            $result->id,
            $result->sqlTime,
            $result->curlTime,
            $result->memoryUsage);
    }

Awesome! You can use Future::iterate to await them in order of resolution, see

function awaitFirst(iterable $futures, ?Cancellation $cancellation = null): mixed
{
foreach (Future::iterate($futures, $cancellation) as $first) {
return $first->await();
}
throw new CompositeLengthException('Argument #1 ($futures) is empty');
}
for an example.

Awesome! You can use Future::iterate to await them in order of resolution, see

function awaitFirst(iterable $futures, ?Cancellation $cancellation = null): mixed
{
foreach (Future::iterate($futures, $cancellation) as $first) {
return $first->await();
}
throw new CompositeLengthException('Argument #1 ($futures) is empty');
}

for an example.

Wonderful, it's always something simple. :) Encountering something curious though.

foreach ($pipeline as $task)

enqueue 18534
enqueue 18535
enqueue 18536
enqueue 18537
enqueue 18538
League [18534] - SQL: 1.1602s cURL: 6.1018s RAM: 4.00M
enqueue 18539
League [18535] - SQL: 0.6634s cURL: 5.6853s RAM: 4.00M
enqueue 18540
League [18536] - SQL: 0.3809s cURL: 5.2650s RAM: 4.00M
enqueue 18541
League [18537] - SQL: 1.4679s cURL: 5.9700s RAM: 4.00M
enqueue 18542
League [18538] - SQL: 1.6496s cURL: 4.7306s RAM: 4.00M
enqueue 18543

foreach (Future::iterate($pipeline) as $task)

enqueue 18535
enqueue 18536
enqueue 18537
League [18535] - SQL: 0.3585s cURL: 5.3134s RAM: 4.00M
enqueue 18538
enqueue 18539
League [18534] - SQL: 0.6633s cURL: 5.6861s RAM: 4.00M
enqueue 18540
enqueue 18541
League [18537] - SQL: 1.0964s cURL: 5.9215s RAM: 4.00M
enqueue 18542
enqueue 18543
League [18536] - SQL: 1.4769s cURL: 5.9908s RAM: 4.00M

Once I start using Future::iterate(), it starts to enqueue 2 items per every completed item. I can only prevent it using LocalSemaphore.

So the best I came up with for now (realized I don't need the Queue at all).

/**
 * @return \Generator|Future[]
 */
function getLeagueTask(array &$leagues, $mode, $pool, $semaphore): \Generator
{
    foreach ($leagues as $id) {
        $task = new ScannerTask($id, $mode);
        $lock = $semaphore->acquire();
        echo "enqueue $id\n";
        yield $pool->submit($task)->getResult()->finally(fn () => $lock->release());
    }
}

function run(int $mode, int $workers)
{
    $pool = Worker\workerPool(new DefaultWorkerPool($workers));
    $semaphore = new LocalSemaphore($workers);

    $statistics = (new ScraperStatistics($workers))->started();

    /** @var int[] $leagues */
    $leagues = getLeagues($mode);
    $tasks = getLeagueTask($leagues, $mode, $pool, $semaphore);

    // Iterate tasks in completion order.
    foreach (Future::iterate($tasks) as $task) {
        /** @var MZLive\Scanner\Result $result */
        $result = $task->await();
        $statistics->completed($result);
        printf(
            "League [%d] - SQL: %.4fs cURL: %.4fs RAM: %.2fM\n",
            $result->id,
            $result->sqlTime,
            $result->curlTime,
            $result->memoryUsage
        );
    }

    echo (string)$statistics->finished();
}

LGTM! Once you have everything on non-blocking I/O, you can skip using amphp/parallel and replace yield $pool->submit($task)->getResult()->finally(fn () => $lock->release()); with yield async($task->run(...))->finally(fn () => $lock->release());.

Thanks! Any idea about the double enqueue without a semaphore?

@whataboutpereira Future::iterate iterates the passed iterator as fast as it can to ensure it can subscribe to the futures in the iterator to select the next one. As items are pulled from the queue that way, Queue::push doesn't wait any longer, because there's space in the queue buffer again. Not sure why it's two items like that, but this explains at least why it does no longer do what you expect.

@whataboutpereira Future::iterate iterates the passed iterator as fast as it can to ensure it can subscribe to the futures in the iterator to select the next one. As items are pulled from the queue that way, Queue::push doesn't wait any longer, because there's space in the queue buffer again.

Aye, noticed the same that the next one queues right before results from the previous one, because Future::iterate() pulls it and frees up space.

Not sure why it's two items like that

Maybe I'll do more digging one day. :) It felt related to the worker pool. Thanks again!

Thanks once more. Got it all working now. :)

Instead of running out of memory, ticking at 110MB of total RAM usage for all processes. :)

You're welcome!

If you say all processes, I guess you're still using amphp/parallel instead of amphp/http-client directly?

You're welcome!

If you say all processes, I guess you're still using amphp/parallel instead of amphp/http-client directly?

Correct! It's another can of worms I might open in the future, but for now I'll sit back and enjoy. This hobby has taken too much time recently. :D

And that's done too now - running nicely with http-client v5.

Awesome, that's great news!