A tool to extract recommendations from YouTube.
It's a client / server architecture.
The server centralizes the recommendations crawled from YouTube, and any number of clients can be spawned to crawl YouTube.
Each client will:
- ask the server for a YouTube video URL to crawl
- crawl it (main info) as well as its channel and the first 10 recommendations
- send the result to the server
- iterate, using a brand new browser session such that history doesn't play a role
The most compute intensive operations are performed by the clients, so it's OK to have one server and many clients (hundreds or thousands probably work fine).
We don't know yet what we will do with the dataset exactly, this is a basis for research.
- a linux machine or something that runs
bash
- the latest
docker
withdocker compose
docker
and docker compose
are the easiest way to run either the server or the clients.
A script named setup-ubuntu is provided to install docker on a brand new Ubuntu Jammy
machine.
Otherwise follow the official instructions, upon which the script is heavily inspired.
You'll need the node server itself, and another server that understands virtual hosts and SSL to do the SSL termination and forward the traffic to node.
I describe how to do that with apache
because I'm more familiar with it, but a similar result
could be obtained with nginx
for instance.
If you have docker compose
, running:
./server <password>
should be enough.
It is recommended to secure the connection with SSL.
Since having SSL certificates in node apps is usually a pain in the ass, I'm using Apache2
for the SSL termination, and it uses mod_proxy
to forward the requests to node.
If you have apache2
installed, you can use
the example vhost, adapting what's necessary (a priori only the ServerName
) to route the traffic to node.
You'll need to enable two apache
modules:
sudo a2enmod proxy
sudo a2enmod remoteip
sudo systemctl reload apache2
There is another vhost to expose the database administration interface, it uses Basic Auth
to protect access to the services exposed.
In this one you'll have to modify ServerName
and ServerAdmin
.
Quick reminder on how to add a user for Basic Auth
sudo htpasswd -c /etc/apache2/.htpasswd <user>
Just copy the 2 vhosts you have just adapted to /etc/apache2/sites-available
, then run sudo a2ensite
on each one,
then enable SSL on both of them by following the instructions from the certbot website.
Customize the seed for this client:
Edit seed_video
in config/production-docker.yaml and client_name
.
A client is identified with its name and IP address (IP as determined by server).
The seed video is associated to the client at its creation and never changes.
Start the client(s)
Still assuming you have docker compose
installed,
just run:
./explore <url> <password> [concurrency=4]
First start the server somewhere with ./server some_password
Then from any computer run:
node dist/bin/createProject.js https://server.com some_password data/urls.sample.txt
Then answer the questions. Project name must be unique. The file that contains the URLs is a regular text file with one URL per line and no header.
It should display something like this:
Server: https://server.com
Password: some_password
{"level":50,"time":1668524728301,"pid":2938054,"hostname":"maison","msg":"Loading config from test.yaml by default. This may be a mistake."}
Project name: test project again
Project description: a test project
Successfully created project Project {
id: 2,
name: 'test project again',
type: 'first level recommendations',
description: 'a test project',
createdAt: '2022-11-15T15:05:44.615Z',
updatedAt: '2022-11-15T15:05:44.615Z'
}
Once the project is created, you can start clients to scrape only the first level recommendations with:
./client server.com some_password 2
Where 2
is the id
of the project you've just created.
The recommendations will be stored in the video
table as usual, with project_id
set to the project you've used.
You can parallelize the scraping across many machines if the list of URLs is long.
In the video
table, videos are unique for a given project, i.e. there is a unique index on project_id
and url
.