Website content extractor

Repository contains Django application with microservice getting URL from user with parameters to get text or images or both and saves it to database. Next periodically celery worker extracts given URL for those parameters and saves it in database. All tasks, saved images and saved texts are visible in REST API.

Procedures to run application

Firstly copy file with environ variables .env_template with new name .env. You can do it by command:

cp .env_template .env

Because of whole environment is containerized by Docker. You have to make sure that docker and docker-compose are installed. To run pull all images, create database store directory and finally run all service run command:

docker-compose up -d

Tasks

To add new task to extract texts or/and images run curl command:

curl -d "url=http://www.example_url.com/&get_image=true" -X POST http://localhost:8000/api/tasks/

Above command get all images from url 'http://www.example_url.com'. If you want get also whole text, then add to data get_text=true.

All tasks are visible on localhost:8000/api/tasks/ which can be filtering and ordering. For example if you want to see only completed tasks add parameter ?state=success.

Images extractor

All completed tasks which extract images are visible on localhost:8000/api/images/ where you can download image clicking on path value of image key.

Texts extractor

All completed tasks which extract texts are visible on localhost:8000/api/texts/. Texts are saved in database in json list, because this type of structure should be more helpful for ML developers than joined one huge string.

About

REST API service to receive URL address and download all images or texts from this site.

Languages

Language:Python 96.7%Language:HTML 1.6%Language:Dockerfile 1.0%Language:Shell 0.7%