hugovk / top-pypi-packages

A regular dump of the most-downloaded packages from PyPI

Home Page:https://hugovk.github.io/top-pypi-packages

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pypinfo now uses more quota, no more 365-day data?

hugovk opened this issue · comments

pypinfo now uses an updated BigQuery table to get download numbers which is more accurate, and uses less quota for most queries, but it's gone up for some.

For example:

    $ pypinfo --days 365 "" project
    Served from cache: False
-    Data processed: 87.84 GiB
+    Data processed: 1.69 TiB
-    Data billed: 87.84 GiB
+    Data billed: 1.69 TiB
-    Estimated cost: $0.43
+    Estimated cost: $8.45

https://github.com/ofek/pypinfo/pull/112/files#diff-7b3ed02bc73dc06b7db906cf97aa91dec2b2eb21f2d92bc5caa761df5bbc168fR233

The 1st April cron successfully fetched the 30-day data:

{
"last_update": "2021-05-01 14:30:19",
"query": {
"bytes_billed": 224987709440,
"bytes_processed": 224987499284,
"cached": false,
"estimated_cost": "1.03"
},
...

That's ~225 GB.

This is up from "bytes_billed": 50120884224 (~50 GB) on 1st April (x4.5 bigger).

But failed on the 365-day:

...
  File "/usr/local/lib/python3.6/dist-packages/google/cloud/_http.py", line 293, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.Forbidden: 403 GET https://www.googleapis.com/bigquery/v2/projects/top-pypi-packages/queries/...?maxResults=0&timeoutMs=10000: Quota exceeded: Your project exceeded quota for free query bytes scanned. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors

On 1st April the 365 was "bytes_billed": 951669751808 (~951 GB), so x4.5 = ~4.28 TB!

The free monthly quota is 1 TB.

  • 1 April was ~50 GB + ~951 GB, must have come in just under the 1 TB limit.

  • 1 May was an ~225 GB + estimated 4.28 TB...

Option 1: Rough calculation: there's quota to get 365 data for 724 packages. So rounding down, perhaps it will work for say, 500 or 100 packages? Would that still be useful?

Option 2: Alternatively, could ditch the 365 data altogether, and perhaps bump 30-day data from 4,000 back up to say 5,000.

Feedback welcome!

In the meantime, I've pushed the 30-day data.