arjungarg07 / Stackoverflow-Crawler

Node.js based recursive stackoverflow questions crawler.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stackoverflow Crawler

Hey there, This is a Node.js based recursive question crawler, which harvests all questions on Stack Overflow with their encountered frequencies and stores them in the MySQL database and in the CSV file as well.

Demo

Youtube Link: https://www.youtube.com/watch?v=H6zzndSSEQM

Features

  • Implemented concurrency limit of API requests.
  • Flexibility to change the concurrency limit of the API requests.
  • Flexibility to choose the page limit for seed Urls of the stackoverflow homepage by the user.
  • Scraping total # of upvotes and total # of answers for every question.
  • Feature to delay API requests in order to prevent the IP address from getting blocked by simulating human behavior.
  • Total reference count for every encountered URL.
  • Implemented a trigger to dump the data in a CSV file when the user kills the script.
  • Implemented a trigger to save the data in the MySQL database when the user kills the script.
  • Kept the code modular and as understandable following best naming conventions.
  • Clean, Readable, Easy to follow code.
  • Used cheerio for HTML parsing.
  • Solution is asynchronous in nature.

TechStack

  • Javascript
  • Node.Js
  • MySQL

Note

Please comment line if you are not able to connect to your local MySQL database, the script will save the data in the CSV file only.

await saveData(data); // comment this line if you don't want to save data to database or having trouble connecting to database

WorkFlow

WorkFlow

Installation

npm install

Execution

node index.js

About

Node.js based recursive stackoverflow questions crawler.


Languages

Language:JavaScript 100.0%