EleutherAI / hn-scraper

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hacker News Scraper

A scraper for Hacker News (duh)

Usage:

run hn_scraper.py (modify the main function call to change the end_id to parse to)

Filtering Details

Parallelism details

Generating comment trees can be expensive. We use two levels of parallelism to solve this. Items can either be stories or comments, we are given the total number of items on Hacker News (~25,000,000). One pool of workers check if items are stories, and if so initiate another pool of workers that traverse the n comment trees (n is the numebr of good top-level comments) in order to build a single comment chain for each top level comment.

Example:

TODO:

[] Add in flags for better data management [] Fix requests issue

About


Languages

Language:Python 100.0%