diegov / searchbox

Personal crawling and indexing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intro

Simple crawling and indexing of personal bookmarks with Elasticsearch backend. Both indexing and search are WIP and need lots of work.

Sources:

  • Github stars
  • Pocket bookmarks
  • Saved Reddit posts
  • Liked tweets

Configuration

The application expects a ~/.config/searchbox/secrets.py file, with the following structure:

github = {
    # Users whose stars will be crawled
    "users_to_crawl": ["user1"],
    # Username for API access
    "username": "apiuser1",
    "personal_access_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

pocket = {
    "consumer_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
    "access_token": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}

elastic = {
    "url": "https://xxxxxxxx",
    # Optional
    "username": "xxxxxxx",
    # Optional
    "password": "xxxxxxx"
}

The elastic section is required, github or pocket can be omitted if the specific spider won't be used.

Unfortunately each service requires its own type of API access, registering app keys, OAuth, etc... so:

Pocket

It requires an application to be registered in https://getpocket.com/developer/apps/, which will give you the consumer_key. After that we've to do the OAuh dance as specified in https://getpocket.com/developer/docs/authentication, which will result in an authenticated access_token.

Github

This is much simpler, see https://help.github.com/en/github/authenticating-to-github/creating-a-personal-access-token-for-the-command-line

Elastic

See the details from your Elasticsearch provider, local deployment, container, etc...

If the Elasticsearch configuration does not automatically create indexes, you'll need to create one called scrapy before indexing.

Running

From the root of the project

python3 -m venv venv
. venv/bin/activate
pip install -r requirements.txt
bin/crawl

There are two spiders, github_stars and pocket. crawl is a shell script that will run both, but they can be run individually instead, eg:

scrapy crawl github_stars

At the end of this some data should be stored in Elasticsearch. There's a simple test script that will query the results

bin/query q python

Will return the first 30 results matching python.

bin/query q python performance

Will return the first 30 results matching python AND performance.

bin/query reset-index

Will delete all the data in the elastic index, and re-create the index.

About

Personal crawling and indexing

License:GNU General Public License v3.0


Languages

Language:Python 98.9%Language:Shell 1.1%