mycaule/dd-assessment

June 2020 DD Assessment

Observations

In the Linux world, it is frequent users and admins have to process relatively large text files. In particular for the problem of sorting and filtering with limited amount of computing resources, there have been very efficient tools from the GNU utils.

Considering the pageviews hourly file is only about 50MB compressed and the blacklist file 3MB, we try a first pragmatic approach using only basic CLI tools from Debian based distributions.

Hence, using wget, cat, gzip, grep, awk, sort, head, we can pretty much solve the exercise in a few lines of codes using these one-liners.

# downloading a file
wget https://dumps.wikimedia.org/other/pageviews/2020/2020-06/pageviews-20200601-020000.gz -P data
# about 1500 different domains

# list of unique domains
zcat pageviews-20200601-020000.gz | awk '{print $1}' | uniq

# top 25 results for fr
zcat pageviews-20200601-020000.gz | grep -E '^fr ' | grep -vF -f blacklist_domains_and_pages | sort -nrk3,3 | head -25 | awk '{print $2" "$3}'

Using the fact that we can read the files into streams and pipe them through different tools, we come up with a nice solution that can run on a computer without much RAM and CPU. It is in fact my case on my personal computer. I believe cluster and cloud computing shouldn't always be used as a hammer to solve problem.

The solution we are studying are similar to the merge sort algorithm, we show the ideas behind three implementations in Bash (GNU utils), Python (Pandas) and Scala (Spark)

Bash solution

We keep reading the gzip file sequentially into subproblems by the domain and keep appending the result file.

cd bash

./run.sh
# or
./run.sh 2020 06 01 00

Running tests with bats.

bats tests.bats

runs in about half an hour
can be optimized by splitting the file and using GNU parallel to take full advantage of multi-core processing.

Python solution

We make further investigations using Pandas in a Colab notebook and finally choose a simple derived implementation using Metaflow in Python.

Metaflow is a open source tool from Netflix for ML pipelines which presents multiple advantages from rapid prototyping to interesting abstractions between the local host and the cloud on AWS using S3, Batch and Step Functions.

You can write simple dags using their Python library and have out of the box CLI interface boilerplate code (documentation, logging, etc.).

We choose to let users be able to choose the number of domains in which to compute in parallel using multiple process.

To do the job completely would require to choose the subjobs in a smart manner over the distribution of domains, and then to join the results at the end.

Pre-requisites

Create a virtual environment with all the python packages required.

python3 -m venv venv
source venv/bin/activate
pip3 install -r requirements.txt

To run the workflow

cd python

# Command line help on the arguments
python3 stats.py run --help

# Shows the DAG
python3 stats.py show

# Running on the hourly file 1 day ago
python3 stats.py run --domains '["zu", "zu.d", "zu.m"]'

Running tests

python3 tests.py

Scala solution

Lastly we use Spark which provides more safety dealing with the datasets over time, but requires running the code on a cluster of machines with more sophisticated software installed.

I first did my investigations using a Databricks notebook and then wrote a small scala project.

cd scala
sbt run
# or
sbt run 2020 06 01 00

Running the tests

sbt test

Going further in the production would require doing more work in the packaging for the archive, and configuring EMR or Dataproc to submit the job. AWS Step Functions and GCP Cloud Composer are also solutions to schedule the job in production.

runs in about a minute on the default Databricks cluster

mycaule / dd-assessment

Observations

Bash solution

Python solution

Scala solution

About

Languages