spekulatius / PHPScraper

A universal web-util for PHP.

Home Page:https://phpscraper.de

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Provide example with authentication

tacman opened this issue · comments

How can I scrape a website that requires authentication?

That is, I want to start with at https://jardinado.herokuapp.com/login, fill in my credentials, and THEN start scraping the site.

That is, I want the $goutteClient to execute something like this first, then scrape:

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Now that cookies are set, when I fetch a url that requires login I should get the page instead of the 302 (redirect to login).

I'm not sure how to implement this within the context of phpscraper. One idea would be to expose the goutte client.

Hmmm, while this should work it's quite a bit work to debug with the site being down (doesn't load for me). Can you bring it back up @tacman ?

Try now. It's a slow site, at least initially, because it's running on a free heroku dyno. It can take up to 30 seconds to "wake up" if it's been inactive for a while.

I set up a login for you -- spekulatius@jardinado.com, password: spekulatius

Hey @tacman

can you share some more code on how you add this to PHPScraper?

            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . "/login", [
                ]);

// select the form and fill in some values
                $form = $crawler->selectButton('login-btn')->form();
                $form['_username'] = 'user';
                $form['_password'] = 'pass';

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();

Thanks :)

Well, that's kind of the point of this issue -- I don't know how to do that. I only see how to click links with phpScraper:

https://github.com/spekulatius/PHPScraper/blob/master/src/phpscraper.php#L918

I was hoping there was a way to submit a form, which would keep the cookies for that session. So instead of ->clickLink(), a method like ->submitForm(), when I could send in the credentials, and then load a page and follow links that require authentication.

Ah okay, now we are getting a bit closer. I've wondered how you did it. Did you get it working with Goutte only?

I have a Symfony bundle that crawls a website: https://github.com/survos/SurvosCrawlerBundle

The idea is that if it can create a set of links that are visible (based on different logins), those links can then be used in a simple PHPUnit test. It basically does what almost all testers do in the beginning -- log in, and click blindly on every link. It's amazing how often someone finds a broken page that way.

So I was trying to use PHPScrapper to do that. In the end, I couldn't, so I just used what other tools I had available:

    public function authenticateClient(?string $username = null, string $plainPassword=null): void
    {
        // might be worth checking out: https://github.com/liip/LiipTestFixturesBundle/pull/62#issuecomment-622191412
        static $clients = [];
        if (!array_key_exists($username, $clients)) {
            $gouteClient = new Client();
            $gouteClient
                ->setMaxRedirects(0);
            $this->username = $username;
            $baseUrl = $this->baseUrl;
            $clients[$username] = $gouteClient;
            if ($username) {
                $crawler = $gouteClient->request('GET', $url = $baseUrl . trim($this->loginPath, '/'), [
                    'proxy' => '127.0.0.1:7080'
                ]);

//            dd($crawler, $url);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() === 200, "Invalid route: " . $url);
//            dd(substr($response->getContent(),0, 1024), $url, $baseUrl);

// select the form and fill in some values
//                $form = $crawler->filter('login_form')->form();
                try {
                    $form = $crawler->selectButton($this->submitButtonSelector)->form();
                } catch (\Exception $exception) {
                    throw new \Exception($this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                }
//                assert($form, $this->submitButtonSelector . ' does not find a form on ' . $this->loginPath);
                    $form['_username'] = $username;
                $form['_password'] = $plainPassword;

// submit that form
                $crawler = $gouteClient->submit($form);
                $response = $gouteClient->getResponse();
                assert($response->getStatusCode() == 200, substr($response->getContent(), 0, 512) . "\n\n" . $url);

https://github.com/survos/SurvosCrawlerBundle/blob/main/src/Services/CrawlerService.php#L108

I don't love the code, though it's functional. If I could drop it all and replace it with PHPScraper, I would. Of course, if there's anything of value you can grab from my bundle, please do so!