v_crawler
v_crawler is a simple project to crawl the Amazon Prime Video platform for movies and series.
It is using the Scrapy Framework and extracts various information about the found content and saves them to a AWS DynamoDB table.
Requirements
Privoxy
-
Install Privoxy
-
Uncomment the following line in the config.txt:
forward-socks4a / 127.0.0.1:9050 .
-
Change the given port from 9050 to 9150
TOR
- Install the TOR browser and have it running
- If you want to make use of the TOR service you have to skip step 2 of Privoxy and change the ports inside the code to 9050.
PostgreSQL
-
Create a "database.ini" file with the following schema:
[postgresql] host=<yourHost> port=<yourPort> dbname=<yourDbName> user=<yourUser> password=<yourPassword>
-
Create a table using the following specs:
CREATE TABLE amazon_video_de ( movie_id VARCHAR(10) not null primary key, url VARCHAR(255) not null, title VARCHAR(255) not null, rating FLOAT, imdb FLOAT, genres VARCHAR[], year NUMERIC, fsk NUMERIC, movie_type varchar(255), poster BYTEA )
- To improve the query speed you want to create an index for the LOWER() funtion of PostgreSQL:
CREATE INDEX ON amazon_video_de (LOWER(title))
IMDb API server
- Install my imdb-api-server
- By default this runs on localhost:8555
Python dependencies
- Install the python dependencies using the requirements.txt