This is a simple crawler for Slashdot. It collects information about each article, and dump it into a MongoDB database. Information collected are:
- title
- url
- content of the article (raw HTML)
- datetime of the article
Create a MongoDB server via Docker:
docker run -d -p 27017:27017 --name mongo_slashdot mongo
Once the crawling is complete, you can dump the DB like this:
mongodump --db=slashdot_db
Run the mongorestore
command on the dump/
folder generated with
mongodump
:
mongorestore dump/