roach-php / core

The complete web scraping toolkit for PHP.

Home Page:https://roach-php.dev

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Duplicate requests being dispatched even with RequestDeduplicationMiddleware in place

awebartisan opened this issue · comments

I have a list of URLs in the database, scrapping specific information these URLs.
I have split these URLs in portions of 50 and dispatch a job by giving the offset from database to start from.

Each job gets the 50 URLs from database and spider starts sending requests. 2 concurrent requests with 1 second delay.
At some point it starts sending duplicate requests as can be seen below and Deduplication middleware doesn't report/drop these requests. Not sure what's going on here. Any thoughts?

[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://brooklinen.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:20] local.INFO: Dispatching request {"uri":"https://taotronics.com"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:23] local.INFO: Item scraped {"store_id":260,"name":"Brooklinen® | The Internet's Favorite Sheets","description":"Luxury bed sheets, pillows, comforters, & blankets delivered straight to your door. The best way to outfit your bedroom.","twitter":"https://twitter.com/brooklinen","facebook":"https://www.facebook.com/Brooklinen/","instagram":"https://www.instagram.com/brooklinen/","contact_us":"https://www.brooklinen.com/pages/contact"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}
[2022-04-24 04:11:24] local.INFO: Item scraped {"store_id":261,"name":"TaoTronics Official Site - Technology Enhances Life – TaoTronics US","description":"TaoTronics official website offers ice makers, air conditioner, tower fan, air cooler, humidifiers, air purifier, True Wireless headphones, noise cancelling headphones, sports headphones, TV sound bar and PC sound bar, LED lamp, therapy lamp, ring light, desk lamp as well as floor lamp at factory direct prices.","twitter":"https://twitter.com/TaoTronics","facebook":"https://www.facebook.com/TaoTronics/","instagram":"https://www.instagram.com/taotronics_official/","contact_us":"https://taotronics.com/pages/contact-us"}

Is it possible that multiple instances of same Spider are using same requests??

Are these logs from multiple spider runs or are they all from the same run? The RequestDeduplicationMiddleware only looks at requests that have been sent during the current run. So if you start multiple spiders with the same URLs, they will all scrape the same site.

My first guess would be that you are dispatching multiple jobs at the same time and they all query the same records from the database. Can you maybe show what the code that dispatches your jobs looks like?

This is how I am dispatching jobs from a console command.

    public function handle(): int
    {
        for ($offset = 1; $offset <= 1000; $offset = $offset + 50) {
            dispatch(new ScrapeStoreSocialLinksJob($offset));
        }

        return 0;
    }

Below is what my job looks like:

    public $timeout = 300;

    public function __construct(public int $offset)
    {}

    public function handle()
    {
        Roach::startSpider(StoreSocialLinksSpider::class, context: ['offset' => $this->offset]);
    }

These logs are from different RUNs, but from the logs I can see these RUNS start at the same time and end at the same time.

I have even tried to chain these jobs so that next job gets dispatched after first one is completed, but still gets duplicate RUNs.

Can you show what the initialRequests method of your spider looks like?

    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }

Behaviour I noticed in the logs:

  • When first 5 jobs are dispatched, everything works as expected.
  • When one of the 5 jobs is completed, and 6th is dispatched, I see 2 requests being duplicated
  • When second of the jobs from first 5 jobs is completed and 7th is dispatched, I see 3 requests being duplicated

Below are some stats from the logs

[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":150,"requests.dropped":0,"items.scraped":146,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":100,"requests.dropped":0,"items.scraped":98,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run statistics {"duration":"00:00:57","requests.sent":50,"requests.dropped":0,"items.scraped":48,"items.dropped":0}
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished
[2022-04-25 05:31:36] local.INFO: Run finished

This may be a silly question, but does your ShopifyStore model contain any duplicates? I can't really see what could be going wrong otherwise. It's also a little strange how the requests.sent and items.scraped both change by exactly 50 (which is also your limit). Does your parse method dispatch additional requests for certain responses?

After your comment I went ahead and checked for duplicates in the table. There were indeed some duplicates. Removed them.

But problem still happening.

Below is my Spider's full source code:

<?php

namespace App\Spiders;

use App\Extractors\Stores\AssignCategory;
use App\Extractors\Stores\ExtractContactUsPageLink;
use App\Extractors\Stores\ExtractDescription;
use App\Extractors\Stores\ExtractFacebookProfileLink;
use App\Extractors\Stores\ExtractInstagramProfileLink;
use App\Extractors\Stores\ExtractLinkedInProfileLink;
use App\Extractors\Stores\ExtractTikTokProfileLink;
use App\Extractors\Stores\ExtractTitle;
use App\Extractors\Stores\ExtractTwitterProfileLink;
use App\Models\ShopifyStore;
use App\Processors\SocialLinksDatabaseProcessor;
use Generator;
use Illuminate\Pipeline\Pipeline;
use RoachPHP\Downloader\Middleware\RequestDeduplicationMiddleware;
use RoachPHP\Extensions\LoggerExtension;
use RoachPHP\Extensions\StatsCollectorExtension;
use RoachPHP\Http\Request;
use RoachPHP\Http\Response;
use RoachPHP\Spider\BasicSpider;
use RoachPHP\Spider\ParseResult;

class StoreSocialLinksSpider extends BasicSpider
{
    public array $startUrls = [
        //
    ];

    public array $downloaderMiddleware = [
        RequestDeduplicationMiddleware::class,
    ];

    public array $spiderMiddleware = [
        //
    ];

    public array $itemProcessors = [
        //SocialLinksDatabaseProcessor::class,
    ];

    public array $extensions = [
        LoggerExtension::class,
        StatsCollectorExtension::class,
    ];

    public int $concurrency = 2;

    public int $requestDelay = 1;

    /**
     * @return Generator<ParseResult>
     */
    public function parse(Response $response): Generator
    {
        $storeData = [
            'store_id' => $response->getRequest()->getMeta('store_id')
        ];

        [, $storeData] = app(Pipeline::class)
            ->send([$response, $storeData])
            ->through([
                ExtractTitle::class,
                ExtractDescription::class,
                ExtractTwitterProfileLink::class,
                ExtractFacebookProfileLink::class,
                ExtractInstagramProfileLink::class,
                ExtractTikTokProfileLink::class,
                ExtractLinkedInProfileLink::class,
                ExtractContactUsPageLink::class
            ])
            ->thenReturn();

        yield $this->item($storeData);
    }

    protected function initialRequests(): array
    {
        return ShopifyStore::query()
            ->offset($this->context['offset'])
            ->limit(50)
            ->get()
            ->map(function (ShopifyStore $shopifyStore) {
                $request = new Request(
                    'GET',
                    "https://" . $shopifyStore->url,
                    [$this, 'parse']
                );
                return $request->withMeta('store_id', $shopifyStore->id);
            })->toArray();
    }
}

parse() method is not making any additional requests.

My thinking here is that something going on with Spider's instance and container.

So my thinking is that the spiders aren't actually sending duplicate requests, but that the extensions (the Logger and StatsCollector, specifically) are reacting to events from different spiders. Couple more questions:

  • Are your jobs actually being queued or do they run on the sync queue?
  • Can you verify that you actually get duplicated items in your SocialLinksDatabaseProcessor?
  • Are you using Laravel Octane?
  • I am using redis + Laravel Horizon for queues
  • I can verify that in a short time ( but it can be assumed that this processor is just getting the items that are being scrapped, so they will contain duplicates)
  • Not using Laravel Octane

Hey @ksassnowski , you are right about the second part. In my SocialLinksDatabaseProcessor I am not getting duplicate items for the duplicate URLs.

So your thinking about the extensions like Logger and StatsCollector sounds right to me.

Just wanted to chime in that I'm experiencing something similar. I have two spiders being executed from a single Laravel Command. Executing one (or the other) results in the StatsCollector outputting expected results. However, if I have both spiders executed, I get a third output of the StatsCollector output that looks like a combination of both. Even if I put a sleep(5) between their execution in the Command, the third, cumulative StatsCollector output occurs...

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

The solution might be to assign every run a unique id and include that as part of the event payload. Then I could scope the events and all corresponding handlers to just that id, even if multiple spiders get started in the same process. I have to check if this can be done without a BC break.

I understand why this happens in your case, @code-poel. Assuming your handle method looks something like this

public function handle()
{
    Roach::startSpider(MySpider1::class);
    Roach::startSpider(MySpider2::class);
}

This is because the EventDispatcher that all extensions rely on gets registered as a singleton. So every spider you run in the same PHP "process" will essentially register its extensions as event listeners again. That's why I was wondering if @awebartisan used Laravel Octane or something similar. It sounded like his commands only spawn a single spider per command so that shouldn't happen.

Yup, that's exactly right. Thanks for the clarification on the root cause!

This bug has existed for more than 1 year, why hasn't it been fixed by now?

Because no one has opened a PR yet to fix it.