Modify html before DomCrawler
posipa opened this issue · comments
Can you add a handler to modify the content before https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L61 ? Please :)
I had to crawl a page that starts like this:
<? xml version = "1.0" encoding = "utf-8"?> <! DOCTYPE html PUBLIC "- // W3C // DTD XHTML 1.0 Transitional // EN" "http://www.w3.org/TR/ xhtml1 / DTD / xhtml1-transitional.dtd">
<html xmlns = "http://www.w3.org/1999/xhtml" xml: lang = "pl-pl" lang = "pl-pl">
<head>
and https://symfony.com/doc/current/components/dom_crawler.html cannot find links because it is not recognized as HTML and this line does not work as it should https://github.com/spatie/crawler/blob/master/src/LinkAdder.php#L63
Feel free to send a PR that adds that feature. Make sure to update the readme and tests as well. Thanks!
You already can do this with a middleware in the handler of the client(feel free to inject dependencies in constructor):
class HTMLEditorMiddleware
{
public function __invoke(callable $nextHandler)
{
return function (RequestInterface $request, $options) use ($nextHandler) {
/** @var PromiseInterface $promise */
$promise = $nextHandler($request, $options);
return $promise->then(
function (ResponseInterface $response) use ($request){
if (200 !== $response->getStatusCode()) {
return $response;
}
$html = $response->getBody()->getContents();
//edit the html as you wish
return $response->withBody(stream_for($html));
}
);
};
}
}
Then in the handler creation:
$handler = HandlerStack::create(new CurlMultiHandler());
$handler->push(new HTMLEditorMiddleware());
$client = new Client(['handler' => $handler])
EDIT: GuzzleHttp middlewares have "after" and "before" methods to allow ordering the middlewares. Just like #342
Thanks for you work on this @posipa. I'm not going to merge it in, since there already seems a way to do this via Guzzle middleware (thanks @Redominus for explaining)