Arthurlpgc / InfoRetrievalProject

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Websites included

The crawler will be retrieving information from the following online judges:

Running Crawler

In order to run your crawler, follow these steps:

  • First, make sure you have Python 3.6 and pip installed in your system. Then:
  1. Go to src folder: cd src
  2. Install project requirements: pip install -r requirements.txt
  3. Run the crawler: scrapy runspider crawler/questions.py

This will start a breadth first search based on some heurístic spider module responsible for downloading all pages in the specified domain. You can see them on the fly in src/retrieved/documents and src/retrieved/objects folder.

Creating an Index

After running the crawler and retrieving documents, you have to manually set up an index to work with. In order to do this:

  1. Go to src folder: cd src
  2. Run the indexer: python3 indexer/indexer.py

It will search for documents stored at src/retrieved/objects and create various indexes accordingly. The indexes will be avaiable for latter querys at the src/indexes folder.

About


Languages

Language:HTML 99.5%Language:Python 0.5%