chapmanjacobd / hn_mining

hackernews data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hn_mining

Welcome! The goal of this repo is to learn new things from HackerNews without having to read everything.

Watch HackerNews from your CLI

Explore 39 years of content without leaving your teletype machine!

pip install xklb
wget https://github.com/chapmanjacobd/hn_mining/raw/main/hackernews_only_direct.tw.db

library watch hackernews_only_direct.tw.db --random --ignore-errors
$ lb pl hackernews.tw.db -a
╒════════════════════════╤═════════════════╤═════════════════════════════════╤═══════════════════╤════════════════╕
│ path                   │ duration        │ avg_playlist_duration           │   playlists_count │   videos_count │
╞════════════════════════╪═════════════════╪═════════════════════════════════╪═══════════════════╪════════════════╡
│ Aggregate of playlists │ 39 years, 2     │ 4 days, 14 hours and 58 minutes │              3098 │         741500 │
│                        │ months, 27 days │                                 │                   │                │
│                        │ and 20 hours    │                                 │                   │                │
╘════════════════════════╧═════════════════╧═════════════════════════════════╧═══════════════════╧════════════════╛

This is what I mean by 39 years of content. 39 years of video running 24/7 (not including 62,876 videos [~8%] where duration is unknown).

$ lb wt hackernews_only_direct.tw.db -pa
╒═══════════╤═════════════════╤══════════════════════════╤════════╤═════════╕
│ path      │ duration        │ avg_duration             │ size   │   count │
╞═══════════╪═════════════════╪══════════════════════════╪════════╪═════════╡
│ Aggregate │ 18 years, 3     │ 1 hour and 22.92 minutes │        │  115987 │
│           │ months, 17 days │                          │        │         │
│           │ and 22 hours    │                          │        │         │
╘═══════════╧═════════════════╧══════════════════════════╧════════╧═════════╛

hackernews_only_direct.tw.db is a subset (including only direct URLs; excluding playlist URLs). It is a bit smaller but still indexes over 18 years of content. edit: only 5 years of content with >7 hn score

Zenodo vs GitHub TubeWatch database

I had to remove some records so that the file would fit in GitHub. See HN.tw on Github for technical details.

The original was uploaded to zenodo.

NB: The zenodo version does not contain all metadata (subtitles, etc) either. For that, after downloading either the zenodo or GitHub version, you would need to run:

library tubeupdate --extra hackernews.tw.db  # this will likely take several days
library optimize hackernews.tw.db  # optional: this will build an fts index

Recipes

Prep

pip install xklb aiohttp
library hnadd hn.db -v  # takes a few days to download 34 million records but you can press ctrl+c when your patience catalyzes

For reference, it took 22 minutes to catch up to the latest 150,000 comments and stories since the last time I ran it. The file is 24GB now.

I was able to compress the whole database down to 11GB using zstd. You can download it here via zenodo.

hn_story_nouns / hn_comment_nouns

# Exported `SELECT title || ' ' || COALESCE(text, '') text FROM hn_story` via dbeaver
# For hn_comments I used `SELECT text FROM hn_comment`. sqlite3 or sqlite-utils might've been faster but dbeaver was fast enough
cat hn_story_202210242110.csv | library nouns > hn_story_nouns

function asc
    sort | uniq -c | sort -g
end

cat hn_story_nouns | asc > hn_story_nouns_common

sed '/^      2 /Q' hn_story_nouns_common > hn_story_nouns_unique
sed -i -e 1,(cat hn_story_nouns_unique | count)d hn_story_nouns_common

hn_story_common_domains

function domains
    awk -F \/ '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." $argv
end

cat hn_story_urls | domains | asc > hn_story_common_domains
sed -i -e 1,(sed '/^      2 /Q' hn_story_common_domains | count)d hn_story_common_domains  # remove boring unique values

hackernews tubewatch database

sqlite-utils hn.db 'select distinct path from hn_story' --csv > hn_story_urls_d
xsv select path hn_story_urls | grep -v twitter | sponge hn_story_urls  # remove quotes; and twitter because it 429 so easy like a little bitch baby
split -d -C 10MB hn_story_urls hn_story_urls_

function b
    fish -c (string join -- ' ' (string escape -- $argv)) &
end

for f in hn_story_urls_*
    b library tubeadd --playlist-files --no-sanitize --safe examples/hackernews.tw.db $f
end

hackernews tubewatch database on github

JWPlatform, Imgur, CBSNewsEmbed, CNN, Viidea, etc did not seem to work after a few tests:

lb-dev watch ~/github/xk/hn_mining/hackernews.tw.db -w "ie_key = 'JWPlatform'"
https://www.businessinsider.com/australia-face-scan-porn-home-affairs-internet-2019-10
Player exited with code 2
  • ie_key WSJ and FoxNews loaded nearly every time but it was always a video that didn't relate to the article lol... NBCNews also suffered from this sometimes
  • ie_key Twitter would hang often or not have a video but I did see this really cool thing! edit: link removed lol. the URL pointed to something else. same problem as above.
  • ie_key WashingtonPost, CBCPlayer do not save a valid URL id so all of that data was impossible to pipe to mpv; ie_key iheartradio as well to some extent
  • ie_key SafariApi just seemed to point to Oreilly books
  • holy cow batman CCC and InfoQ videos load fast. They must have a 16 core PC with a 10 bit pipe?
DELETE FROM media WHERE ie_key IN ('JWPlatform', 'Imgur', 'WSJ', 'Twitter', 'WashingtonPost', 'CBSNewsEmbed', 'CNN', 'SafariApi', 'Viidea', 'NBCNews', 'FoxNews','NineCNineMedia', 'Mixcloud', 'CBCPlayer', 'LinkedIn','AmazonStore', 'Spotify');
DELETE FROM media WHERE playlist_path LIKE '%bostonglobe.com%';
ALTER TABLE media DROP COLUMN tags;
sqlite-utils disable-fts hackernews.tw.db media
sqlite-utils vacuum hackernews.tw.db
zstd hackernews.tw.db

Recipe hackernews_only_direct.tw.db

cp hackernews.tw.db hackernews_only_direct.tw.db
sqlite-utils hackernews_only_direct.tw.db 'delete from media where playlist_path in (select path from playlists)'
sqlite3
ATTACH 'hackernews_only_direct.tw.db' AS h;
ATTACH '/home/xk/lb/hn.db' AS hn;

create table hn_score as
  select path, max(score) as score
  from hn_story
  where path in (select path from h.media)
  group by 1
;

alter table h.media add column score int;

UPDATE
  h.media SET score = (
        SELECT score
        FROM hn_score
        WHERE hn_score.path = h.media.path
    )
  WHERE
      EXISTS(
          SELECT 1
          FROM hn_score
          WHERE hn_score.path = h.media.path
      );  -- this takes about eight minutes

Mistakes

If you have a careful eye you will notice that the results are not perfect. Here are some weird things I noticed which are probably my fault:

mistakes, hn_story_common_domains

  • duplicates (or maybe this is a bug in uniq (GNU coreutils) 9.0 ?)
  • co.uk and other two letter subdomains are missing from my awk query

mistakes, nouns

  • ” is included (fixed with v1.19.027) but I'm too lazy to update the data in this repo unless someone submits a PR here

mistakes, tubewatch

  • In hindsight I probably should've filtered out some spam by using is_dead IS NULL or something

About

hackernews data

License:Creative Commons Zero v1.0 Universal