httparchieve query is invalid
arbazkiraak opened this issue · comments
Arbaz Hussain commented
It looks like some changes has been made on httparchive
INFO[0000] Generated SQL template for HackerNews. Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive. Mode=Subdomains
INFO[0000] Executing BigQuery SQL... this could take some time. Mode=Subdomains Source=hackernews
INFO[0022] Total rows extracted 74160. Mode=Subdomains Silent=false Source=hackernews Verbose=false
INFO[0022] Executing BigQuery SQL... this could take some time. Mode=Subdomains Source=httparchive
FATA[0025] Error executing BigQuery SQL. Error="googleapi: Error 400: Unrecognized name: origin at [17:16], invalidQuery" Mode=Subdomains Source=httparchive
Arbaz Hussain commented
- Seems like
httparchieve
has been remove from Google Cloud data.
cqsd commented
I believe assets/assets.go
needs to be updated to pick up the new SQL. A fresh build still uses origin at the moment:
./commonspeak2 -c credentials.json -p $PROJECT --verbose subdomains -o test.txt -l 100
...
<snip>
...
INFO[0000] Compiled SQL Template: CREATE TEMPORARY FUNCTION
getSubdomain(x STRING)
RETURNS STRING
LANGUAGE js AS """
function getSubdomain(s) {
try {
return URI(s).subdomain();
} catch (ex) {
return s;
}
}
return getSubdomain(x);
"""
OPTIONS
( library="gs://commonspeak-udf/URI.min.js" );
SELECT
getSubdomain(origin) AS subdomain,
COUNT(origin) AS count
FROM
`httparchive.urls.*`
GROUP BY
subdomain
ORDER BY
count DESC
LIMIT
100; Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive. Mode=Subdomains
After updating with go-bindata -pkg assets -o assets/assets.go data/...
, I get the expected SQL:
INFO[0000] Compiled SQL Template: CREATE TEMPORARY FUNCTION
getSubdomain(x STRING)
RETURNS STRING
LANGUAGE js AS """
function getSubdomain(s) {
try {
return URI(s).subdomain();
} catch (ex) {
return s;
}
}
return getSubdomain(x);
"""
OPTIONS
( library="gs://commonspeak-udf/URI.min.js" );
SELECT
getSubdomain(url) AS subdomain,
COUNT(url) AS count
FROM
`httparchive.urls.*`
GROUP BY
subdomain
ORDER BY
count DESC
LIMIT
100;
Mode=Subdomains
Edit: PR: #11