assetnote / commonspeak2

Leverages publicly available datasets from Google BigQuery to generate content discovery and subdomain wordlists

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

httparchieve query is invalid

arbazkiraak opened this issue · comments

It looks like some changes has been made on httparchive

INFO[0000] Generated SQL template for HackerNews.        Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive.       Mode=Subdomains
INFO[0000] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=hackernews

INFO[0022] Total rows extracted 74160.                   Mode=Subdomains Silent=false Source=hackernews Verbose=false
INFO[0022] Executing BigQuery SQL... this could take some time.  Mode=Subdomains Source=httparchive
FATA[0025] Error executing BigQuery SQL.                 Error="googleapi: Error 400: Unrecognized name: origin at [17:16], invalidQuery" Mode=Subdomains Source=httparchive
  • Seems like httparchieve has been remove from Google Cloud data.
commented

This was fixed in PR #8 - thanks for reporting.

commented

I believe assets/assets.go needs to be updated to pick up the new SQL. A fresh build still uses origin at the moment:

./commonspeak2 -c credentials.json -p $PROJECT --verbose subdomains -o test.txt -l 100
  ...
  <snip>
  ...
INFO[0000] Compiled SQL Template: CREATE TEMPORARY FUNCTION
  getSubdomain(x STRING)
  RETURNS STRING
  LANGUAGE js AS """
  function getSubdomain(s) {
    try {
      return URI(s).subdomain();
    } catch (ex) {
      return s;
    }
  }
  return getSubdomain(x);
"""
OPTIONS
  ( library="gs://commonspeak-udf/URI.min.js" );
SELECT
  getSubdomain(origin) AS subdomain,
  COUNT(origin) AS count
FROM
  `httparchive.urls.*`
GROUP BY
  subdomain
ORDER BY
  count DESC
LIMIT
  100;  Mode=Subdomains
INFO[0000] Generated SQL template for HTTPArchive.       Mode=Subdomains

After updating with go-bindata -pkg assets -o assets/assets.go data/..., I get the expected SQL:

INFO[0000] Compiled SQL Template: CREATE TEMPORARY FUNCTION
  getSubdomain(x STRING)
  RETURNS STRING
  LANGUAGE js AS """
  function getSubdomain(s) {
    try {
      return URI(s).subdomain();
    } catch (ex) {
      return s;
    }
  }
  return getSubdomain(x);
"""
OPTIONS
  ( library="gs://commonspeak-udf/URI.min.js" );
SELECT
  getSubdomain(url) AS subdomain,
  COUNT(url) AS count
FROM
  `httparchive.urls.*`
GROUP BY
  subdomain
ORDER BY
  count DESC
LIMIT
  100;
  Mode=Subdomains

Edit: PR: #11