Corion / WWW-Mechanize-Chrome

automate the Chrome browser

Home Page:https://metacpan.org/release/WWW-Mechanize-Chrome

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The content is always wrapped in HTML

robrwo opened this issue · comments

When requesting something that returns a non-HTML document, e.g. application/json, if the response from the server is HTTP 304, then the content_type is undefined but the content (presumably the cached content) is wrapped in HTML, e.g.

<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{ value => 1 }</pre></body></html>

This consistently happens when the response is HTTP 304, but this seems to happen sometimes when the response is HTTP 200.

Looking at the code, the decoded_content method is treating the content as HTML:

sub decoded_content($self) {
    $self->document_future->then(sub( $root ) {
        # Join _all_ child nodes together to also fetch DOCTYPE nodes
        # and the stuff that comes after them
        my @content = map {
            my $nodeId = $_->{nodeId};
            $self->log('trace', "Fetching HTML for node " . $nodeId );
            $self->target->send_message('DOM.getOuterHTML', nodeId => 0+$nodeId )
        } @{ $root->{root}->{children} };
 
        Future->wait_all( @content )
    })->then( sub( @outerHTML_f ) {
        Future->done( join "", map { $_->get->{outerHTML} } @outerHTML_f )
    })->get;
};

It should check the content type, and perhaps just return the raw content instead?

Yes, this is an ugly problem. Can you see if the following works well enough for your use case(s)? It uses the content from the response, but that content will already have been decoded and I'm not sure how well it works with binary content:

sub decoded_content($self) {
    $self->document_future->then(sub( $root ) {
        # Join _all_ child nodes together to also fetch DOCTYPE nodes
        # and the stuff that comes after them
        my $ct = $self->ct;

        my $res;
        if( $ct eq 'text/html' ) {
            my @content = map {
                my $nodeId = $_->{nodeId};
                $self->log('trace', "Fetching HTML for node " . $nodeId );
                $self->target->send_message('DOM.getOuterHTML', nodeId => 0+$nodeId )
            } @{ $root->{root}->{children} };

            $res = Future->wait_all( @content )
            ->then( sub( @outerHTML_f ) {
                Future->done( join "", map { $_->get->{outerHTML} } @outerHTML_f );
            });
        } else {

            # Return the raw body
            #use Data::Dumper;
            #warn Dumper $self->response;
            #warn $self->response->content;

            # The content is already decoded (?!)
            # I'm not sure how well this plays with encodings, and
            # binary content
            $res = Future->done($self->response->content);
        };
        return $res;
    })->get;
};

That works for JSON data, but I get an error for HTML pages:

Could not find node with given id

-32000 at perl5/perlbrew/perls/perl-5.28.1/lib/site_perl/5.28.1/Chrome/DevToolsProtocol/Target.pm line 491
at perl5/perlbrew/perls/perl-5.28.1/lib/site_perl/5.28.1/Future.pm line 882

Whoops - sorry, I didn't run the test suite properly. This one passes my new test and the existing test suite - does it work for your case too?

sub decoded_content($self) {
    my $res;
    my $ct = $self->ct || 'text/html';
    if( $ct eq 'text/html' ) {
        $res = $self->document_future->then(sub( $root ) {
        # Join _all_ child nodes together to also fetch DOCTYPE nodes
        # and the stuff that comes after them

            my @content = map {
                my $nodeId = $_->{nodeId};
                $self->log('trace', "Fetching HTML for node " . $nodeId );
                $self->target->send_message('DOM.getOuterHTML', nodeId => 0+$nodeId )
            } @{ $root->{root}->{children} };

            return Future->wait_all( @content )
            ->then( sub( @outerHTML_f ) {
                Future->done( join "", map { $_->get->{outerHTML} } @outerHTML_f );
            });
        });
    } else {
        # Return the raw body
        #use Data::Dumper;
        #warn Dumper $self->response;
        #warn $self->response->content;

        # The content is already decoded (?!)
        # I'm not sure how well this plays with encodings, and
        # binary content
        $res = Future->done($self->response->content);
    };
    return $res->get
};

This seems better.