spatie / crawler

An easy to use, powerful crawler implemented in PHP. Can execute Javascript.

Home Page:https://freek.dev/308-building-a-crawler-in-php

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Modify html before DomCrawler

posipa opened this issue · comments

Can you add a handler to modify the content before https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L61 ? Please :)
I had to crawl a page that starts like this:

<? xml version = "1.0" encoding = "utf-8"?> <! DOCTYPE html PUBLIC "- // W3C // DTD XHTML 1.0 Transitional // EN" "http://www.w3.org/TR/ xhtml1 / DTD / xhtml1-transitional.dtd">
<html xmlns = "http://www.w3.org/1999/xhtml" xml: lang = "pl-pl" lang = "pl-pl">
<head>

and https://symfony.com/doc/current/components/dom_crawler.html cannot find links because it is not recognized as HTML and this line does not work as it should https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L63

Feel free to send a PR that adds that feature. Make sure to update the readme and tests as well. Thanks!

You already can do this with a middleware in the handler of the client(feel free to inject dependencies in constructor):

class HTMLEditorMiddleware
{
    public function __invoke(callable $nextHandler)
    {
        return function (RequestInterface $request, $options) use ($nextHandler) {
            /** @var PromiseInterface $promise */
            $promise = $nextHandler($request, $options);
            return $promise->then(
                 function (ResponseInterface $response) use ($request){
                    if (200 !== $response->getStatusCode()) {
                        return $response;
                    }
                    $html = $response->getBody()->getContents();
                    //edit the html as you wish
                    return $response->withBody(stream_for($html));
                }
            );
        };
    }
}

Then in the handler creation:

$handler = HandlerStack::create(new CurlMultiHandler());
$handler->push(new HTMLEditorMiddleware());
$client = new Client(['handler' => $handler])

EDIT: GuzzleHttp middlewares have "after" and "before" methods to allow ordering the middlewares. Just like #342

Thanks for you work on this @posipa. I'm not going to merge it in, since there already seems a way to do this via Guzzle middleware (thanks @Redominus for explaining)