ariya / phantomjs

Scriptable Headless Browser

Home Page:http://phantomjs.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get amount of transfered bytes

ariya opened this issue · comments

marcon...@gmail.com commented:

There is no way to reliably get the amount of transfered bytes for a request.

bodySize is not available for responses with stage==end and the content-length header is not very reliable, especially as it seems to be unset for requests with content-encoding "gzip".

I guess the bodySize has to be summed up for each bunch of data received and made available in a respose with stage==end.

Disclaimer:
This issue was migrated on 2013-03-15 from the project's former issue tracker on Google Code, Issue #156.
🌟   9 people had starred this issue at the time of migration.

ariya.hi...@gmail.com commented:

 

 
Metadata Updates

  • Label(s) removed:
    • Type-Defect
  • Label(s) added:
    • Type-Enhancement
  • Milestone updated: FutureRelease (was: ---)
  • Status updated: Accepted

marceldu...@gmail.com commented:

Any reason why Content-Length header isn't available for Content-Encoding:gzip responses?

While issue 158 is still open there's no way to get both compressed and uncompressed sizes of gzip responses, therefore netsniff.js example that generates HAR file is a bit misleading:

https://github.com/ariya/phantomjs/blob/master/examples/netsniff.js#L51

According to HAR spec (http://www.softwareishard.com/blog/har-12-spec/#content), content.size:
"... should be equal to response.bodySize if there is no compression and bigger when the content has been compressed."

marceldu...@gmail.com commented:

Another gzip/raw content issue:

By running:
phantomjs netsniff.js http://search.yahoo.com

The generated HAR shows that the main html response headers contains Content-Encoding: gzip and the bodySize is 12726.

However by running curl with compression it gets different result:

curl search.yahoo.com -H "Accept-Encoding:gzip" | wc -c
4328

And without compression it gets similar size for what phantomjs is returning:

curl search.yahoo.com | wc -c
12120

I see this was migrated to 'feature enhancement', but I think this should be considered a bug. Anyone using the HAR output from netsniff.js are seeing uncompressed bytes only, and are getting an inaccurate representation of actual bytes transferred.

Is this data not easily accessible from QT?

+1 on this, any suggestion where the extra bytes are comming from?

For me all bytesizes on CSS/JS Files are shown significant smaller then tey are in reallity (According to Chorme Dev Tools and Firebug)
Also comparing to the gzip Files size they are shown too small.

Imagesizes are all shown correct. Anybody else has that kind of Problem?

Seems to still be an issue in 2.0. I get the impression Qt/Webkit changes might be needed?

I beleive if you are talking to a chunking server content-length is not set, instead the size of each chunk is passed before the data itself and when a size of zero is returned the resource is complete. So that may explain why content-length is not present sometimes.

Looking at networkaccessmanager.cpp NetworkAccessManager::handleStarted() sets the bodySize to reply->size(); NetworkAccessManager::handleFinished does not set the bodySize to presumably it is left as is and is the size of the content (when not chunking) or the first chunk.

QTNetworkReply has a downloadProgress signal which returns bytesReceived and bytesTotal. Perhaps that could be used.

NetworkAccessManager::handleFinished could set the bodySize to the content-length where it is available.

Its a pity there does not appear to be a signal for each chunk (unless downloadProgress provides that) as it would then be possible to determine the size downloaded correctly by simply adding the chunksize to bodySize

I did some more research and it appears QT must be removing the content-length header when gzip is used. I did the same request via telnet and via phantomjs, note chunking is not in use.

telnet response:-

Cache-Control: private
Content-Type: text/html; charset=utf-8
Content-Encoding: gzip
Vary: Accept-Encoding
Server: Microsoft-IIS/8.0
Set-Cookie: ASP.NET_SessionId=onq34pudvbwazeh04ksylpfs; path=/; HttpOnly
X-AspNetMvc-Version: 4.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
X-Frame-Options: SAMEORIGIN
Date: Wed, 29 Apr 2015 14:54:17 GMT
Content-Length: 22767

phantom response:-

Cache-Control = private
Content-Type = text/html; charset=utf-8
Content-Encoding = gzip
Vary = Accept-Encoding
Server = Microsoft-IIS/8.0
Set-Cookie = ASP.NET_SessionId=e02yvkniwvolblo31qyt42ia; path=/; HttpOnly
X-AspNetMvc-Version = 4.0
X-AspNet-Version = 4.0.30319
X-Powered-By = ASP.NET
X-Frame-Options = SAMEORIGIN
Date = Wed, 29 Apr 2015 14:51:42 GMT

It would appear QT is for some reason removing the header.

Changenetworkaccessmanager.cpp
data["bodySize"] = reply->size();
to
data["bodySize"] = reply->header(QNetworkRequest::ContentLengthHeader);
This then means that when Content-Length is passed bodySize is correct.
It won't work (but then neither does the current code) when chunking is in use or the Content-Length is not passed by QT such as when gzip is used. Disabling gzip in the 2nd case works round that issue.

from what I can see size() is just the size of the QbyteArray.......

for gzip you need to set the header yourself to accept gzip as there is a bug in QT

https://bugreports.qt.io/browse/QTBUG-41840

Content-Length is then returned unfortunately you then run accross bug https://forum.qt.io/topic/2308/content-encoding-gzip-with-qt-webkit/9 and the content is not decompressed.

Great digging work @djberriman! Go on!
Thousands of people are supporting you!

👍 👍 👍

QT does indeed specifically remove the content-length header on gzip data

void QHttpNetworkReplyPrivate::removeAutoDecompressHeader()
{
// The header "Content-Encoding = gzip" is retained.
// Content-Length is removed since the actual one send by the server is for compressed data
QByteArray name("content-length");
QList<QPair<QByteArray, QByteArray> >::Iterator it = fields.begin(),
end = fields.end();
while (it != end) {
if (qstricmp(name.constData(), it->first.constData()) == 0) {
fields.erase(it);
break;
}
++it;
}

}

From what I can see from the QT source code it may well be worth using the QTNetworkReply downloadProgress signal which returns bytesReceived and bytesTotal. I believe this will also mean chunked data will work correctly as it will fire for each chunk.

I appear to have a fix for this not sure how to submit it so I will work on that in a moment.

Basically phantomjs is not trapping one of the emits from QT so the size returned is that of the first read. We need to add another stage as well as 'start' and 'end' which I have called 'data'. If you cater for this in your onResourceReceived function and add up the res.bodySize returned each time it is triggered for a particular resource (End will return 0) then you will have the true size of the content. This should I believe work regardless of conent-length being passed, gzip or chunking. Do not rely on Content-Length.

replace handleStarted() in networkaccessmanager.cpp with the following code.

void NetworkAccessManager::handleStarted()
{
QNetworkReply reply = qobject_cast<QNetworkReply>(sender());
if (!reply)
return;

QVariantList headers;
foreach (QByteArray headerName, reply->rawHeaderList()) {
    QVariantMap header;
    header["name"] = QString::fromUtf8(headerName);
    header["value"] = QString::fromUtf8(reply->rawHeader(headerName));
    headers += header;
}

QVariantMap data;
if (!m_started.contains(reply)) {
  m_started += reply;
  data["stage"] = "start";
}
else {
  data["stage"] = "data";
}
data["id"] = m_ids.value(reply);
data["url"] = reply->url().toEncoded().data();
data["status"] = reply->attribute(QNetworkRequest::HttpStatusCodeAttribute);
data["statusText"] = reply->attribute(QNetworkRequest::HttpReasonPhraseAttribute);
data["contentType"] = reply->header(QNetworkRequest::ContentTypeHeader);
data["bodySize"] = reply->size();
data["redirectURL"] = reply->header(QNetworkRequest::LocationHeader);
data["headers"] = headers;
data["time"] = QDateTime::currentDateTime();

emit resourceReceived(data);

}

Just be aware the total size returned appears to be the uncompressed size not the content-length when gzip is being used, ran a test allowing gzip and one not allowing gzip and got same results.

@djberriman Any thoughts on getting the gzip sizes?

@djberriman Thanks so much for this fix, this is exactly what I need for my project.

Can anyone give a general idea of the changes that should be made to the onResourceReceived function, especially in the context of the netsniff.js example (https://github.com/ariya/phantomjs/blob/master/examples/netsniff.js) I've built phantomjs with this fix but I'm a little unsure how to implement it in a script. Thanks!

EDIT: I seem to have solved my issue. For anyone else with as little phantomjs experience as I have who finds this thread, in the above example, you can change

page.onResourceReceived = function (res) {
if (res.stage === 'start') {
page.resources[res.id].startReply = res;
}
if (res.stage === 'end') {
page.resources[res.id].endReply = res;
}
};

to

page.onResourceReceived = function (res) {
    if (res.stage === 'start') {
        page.resources[res.id].startReply = res;
    }
    if (res.stage === 'data') {
        page.resources[res.id].startReply.bodySize += res.bodySize;
    }
    if (res.stage === 'end') {
        page.resources[res.id].endReply = res;
    }
};

And it should work with @djberriman's change.

@ariya @djberriman what's the resolution on this one?

@tufandevrim Just waiting for @ariya to put it in the main line

@ariya @djberriman ... did we finally merged this one in 2.1.1 ? Fix looks good to me.

Has this been solved? Thanks.

onResourceReceived function should read more like:-

if (res.stage == 'start') {
urlRequestedBytes[res.id] = res.bodySize;
}
else {
if (res.bodySize != undefined) {
urlRequestedBytes[res.id] += res.bodySize;
}
}

During my testing I found both 'data' and 'end' could return a size depending on whether chunking is in use and that it can also be returned as undefined. To get the correct size in all cases you need to add the value returned in bodySize in each 'start','data' and 'end'.

Just a quick update on content-length with encoded response (gzip). The lack of a content-length header was due to a feature of QT whereby they physically removed the header if it was compressed. Following proof of the bug/feature and some discussion the code will now be removed from QT that does this which means content-length will always be passed if returned from the server (chunking servers for instance don't return a length).

@djberriman for gzipped response you will probably have no header content-length as the content will be stream which you can verify if there is the header "Transfer-Encoding: chunked".
If the content has already been gzipped before (cache, disk, ...), the server will set the content-length header as it will know the length of the gzip archive.

commented

@djberriman with regards to the content length, the current version of QT will emit the content length header though 'downloadMetaData', but i'm not convinced the value of the contentLength header is really the best thing to use if you actually want the amount of bytes transferred, that omits the size of the header, which if you have a lot of cookies can be significant, especially across all the requests required to render a web page.

It seems like using downloadProgress, which you mentioned earlier might be a better approach, depending on what your use case is. Better yet would be if the QT library had something like reply->bytes_transferred. based on the documentation of downloadProgress[1] though, it does seem like that is the best approach. Though I think QT removing the contentLength header is kind of dumb too.

[1] http://doc.qt.io/qt-5/qnetworkreply.html#downloadProgress

Strangely I'm in need of the content length header only. What's the state of play on this? Has this been resolved in a later version of Phantom. I'm using 2.1.1