raja-d / v_crawler

A simple project to crawl the Amazon Prime Video platform for movies and series

v_crawler

v_crawler is a simple project to crawl the Amazon Prime Video platform for movies and series.

It is using the Scrapy Framework and extracts various information about the found content and saves them to a AWS DynamoDB table.

Requirements

Privoxy

Install Privoxy
Uncomment the following line in the config.txt:

forward-socks4a / 127.0.0.1:9050 .
Change the given port from 9050 to 9150

TOR

Install the TOR browser and have it running
If you want to make use of the TOR service you have to skip step 2 of Privoxy and change the ports inside the code to 9050.

PostgreSQL

Create a "database.ini" file with the following schema:

[postgresql]
host=<yourHost>
port=<yourPort>
dbname=<yourDbName>
user=<yourUser>
password=<yourPassword>

Create a table using the following specs:

CREATE TABLE amazon_video_de
(
    movie_id VARCHAR(10) not null primary key,
    url VARCHAR(255) not null,
    title VARCHAR(255) not null,
    rating FLOAT,
    imdb FLOAT,
    genres VARCHAR[],
    year NUMERIC,
    fsk NUMERIC,
    movie_type varchar(255),
    poster BYTEA
)

To improve the query speed you want to create an index for the LOWER() funtion of PostgreSQL:

CREATE INDEX ON amazon_video_de (LOWER(title))

IMDb API server

Install my imdb-api-server
By default this runs on localhost:8555

Python dependencies

Install the python dependencies using the requirements.txt

About

A simple project to crawl the Amazon Prime Video platform for movies and series

Languages

Language:Python 100.0%