rafaelleinio / meli-challenge

A nice Graph and Spark based solution for the Characters Interactions problem.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

⚠️ the external dependency graphframes:0.8.1-spark2.4-s_2.11 seems to be failing to install, I think the repository was deprecated. A refactoring/fix for the solution is needed for the commands to work properly. When I have the time, I will come back to this, sorry 😅

Meli Challenge

A nice Graph and Spark based solution for the Characters Interactions problem.

Python Version Code style: black aa

Build Status:

Core Docker Image
Tests Build Docker Image

The Challenge and Proposed Arch for Solution

The objective of the challenges is to analyze networks of interaction between the characters in the books of the saga A Song of Ice and Fire, written by George R. R. Martin.

A dataset (data/dataset.csv) was provided informing data about all interactions for every character in the first 3 books.

The challenges basically are:

  • 1st Challenge: compute and display the total of interactions for all characters, for each book in the knowledge base.
  • 2nd Challenge: compute and display all of the mutual friendships between all pairs of characters with more than one friend in common.
  • 3rd Challenge: create an API to register new relations and query mutual friends between two characters.

In the course of this readme more in-depth details will be provided about the challenges and the expected results for each one of them.

Proposed Solution

The problem was addressed by modeling the data as a graph structure. As it was about characters and the relationships between them, it was easy to model this structure using the characters as the vertices and the interactions as edges.

The technology used as the processing engine for the data is Apache Spark. Spark is a fast and big-data-ready technology for data processing and has several extensions for different data domains. Here in this repository I use the Graphframes Spark extension, which is a graph processing engine mounted on top of the powerful Dataframe API from Spark.

Main benefits from this technology decision:

  • Apache Spark is the state-of-the-art big data technology used in modern data platforms nowadays.
  • The same core code used here to process some KBs of data in a single machine, can be used for multi-TBs of data in a huge cluster of instances with little to no changes.
  • Spark can process batch and streams of data equivalently easily and both modes are used and available in the core module.
    • This means we can maintain a in-memory Graph representation updating in real-time from a streaming knowledge base dataset.

Core module simple representation:

Getting started

Clone the project:

git clone git@github.com:rafaelleinio/meli-challenge.git
cd meli-challenge

Build Docker image

docker build --tag meli-challenge .

Get in container context

docker run --network host -it meli-challenge

CLI

Command line client to interact with the graph knowledge base and print the summarized interactions and mutual friendships between characters.

Run commands from container context:

python meli_challenge/cli.py --help

Output:

usage: cli.py [-h] {summarize_interactions,summarize_mutual_friendships} ...

positional arguments:
  {summarize_interactions,summarize_mutual_friendships}
                        Desired action to perform
    summarize_interactions
                        Display the sum of interactions over defined books for
                        all characters.
    summarize_mutual_friendships
                        Display mutual friendships between all pair of
                        characters.

optional arguments:
  -h, --help            show this help message and exit

There are two executions modes:

  • summarize_interactions: compute and print the aggregated sum of interactions over defined books from all characters. (Challenge 1)
  • summarize_mutual_friendships: compute and print the array aggregation of all mutual friends from all characters. (Challenge 2)

summarize_interactions

python meli_challenge/cli.py summarize_interactions --help

Output:

usage: cli.py summarize_interactions [-h] --csv CSV_PATH --books BOOKS
                                     [BOOKS ...]

optional arguments:
  -h, --help            show this help message and exit
  --csv CSV_PATH        Knowledge base input on CSV format.
  --books BOOKS [BOOKS ...]
                        Book numbers to query for.

Args:

  • csv: path to csv file to input the knowledge base for building the graph.
  • books: books to aggregate over.

Example:

python meli_challenge/cli.py summarize_interactions --csv data/dataset.csv --books 1 2 3

Output:

Tyrion-Lannister	650,829,782,2261
Jon-Snow	784,360,756,1900
Joffrey-Baratheon	422,629,598,1649
Eddard-Stark	1284,169,94,1547
Sansa-Stark	545,313,532,1390
...

summarize_mutual_friendships

python meli_challenge/cli.py summarize_mutual_friendships --help

Output:

usage: cli.py summarize_mutual_friendships [-h] --csv CSV_PATH

optional arguments:
  -h, --help      show this help message and exit
  --csv CSV_PATH  Knowledge base input on CSV format.

Args:

  • csv: path to csv file to input the knowledge base for building the graph.

Example:

python meli_challenge/cli.py summarize_mutual_friendships --csv data/dataset.csv

Output:

Addam-Marbrand	Kevan-Lannister	Tywin-Lannister,Tyrion-Lannister,Varys,Joffrey-Baratheon,Jaime-Lannister
Alayaya	Mandon-Moore	Cersei-Lannister,Tyrion-Lannister,Bronn
Alyn	Maron-Greyjoy	Eddard-Stark
Amabel	Chiswyck	Arya-Stark
Arthur-Dayne	Lewyn-Martell	Gerold-Hightower
...

API

Run command from container context:

python meli_challenge/api.py

This command will start up the server listening the request on port 5000 (default Flask)

There are 2 endpoints:

HTTP POST /interaction

which will receive as a parameter a JSON with the interaction between 2 characters, with the following format:

{
    "source": "Character Name 1",
    "target": "Character Name 2",
    "weight": "Number of interactions between the two characters in 1 particular book ",
    "book": "Number of the book in the saga where the interaction took place"
}

HTTP GET /common-friends​

From: /common-friends?source=P1_NAME&target=P2_NAME

  • P1_NAME: refers to character 1
  • P2_NAME: refers to character 2

In response to success, the API will return status code 200 followed by the list of mutual friends between the two characters, in the following format:

{
    "common_friends": ["Cersei-Lannister", "Arya-Stark"]
}

Example:

From another shell tab, that can be outside container context, you'll be able to perform the requests.

Adding new connection:

curl -i -X POST -H "Content-Type: application/json" \
	-d '{"source":"c1","target":"c2","weight":3,"book":4}' \
	'http://localhost:5000/interaction'

Output:

HTTP/1.0 201 CREATED
Content-Type: application/json
Content-Length: 2
Server: Werkzeug/1.0.1 Python/3.7.8
Date: Thu, 01 Oct 2020 14:13:40 GMT

Ok

Adding more connections:

curl -i -X POST -H "Content-Type: application/json" \
	-d '{"source":"c3","target":"c2","weight":3,"book":4}' \
	'http://localhost:5000/interaction'

curl -i -X POST -H "Content-Type: application/json" \
	-d '{"source":"c1","target":"c4","weight":3,"book":4}' \
	'http://localhost:5000/interaction'

curl -i -X POST -H "Content-Type: application/json" \
	-d '{"source":"c3","target":"c4","weight":3,"book":4}' \
	'http://localhost:5000/interaction'

Query the Graph for mutual connections between c1 and c3:

curl -i -X GET -H "Content-Type: application/json" \
	'http://localhost:5000/common-friends?source=c1&target=c3'

Output:

HTTP/1.0 200 OK
Content-Type: application/json
Content-Length: 31
Server: Werkzeug/1.0.1 Python/3.7.8
Date: Thu, 01 Oct 2020 14:15:04 GMT

{"common_friends":["c2","c4"]}

Development

From repository root directory:

Install dependencies

make requirements

Code Style

Check code style:

make style-check

Apply code style with black

make apply-style

Check code quality with flake8

make quality-check

Testing and Coverage

Unit tests:

make unit-tests

Integration tests:

make integration-tests

All tests:

make tests

About

A nice Graph and Spark based solution for the Characters Interactions problem.

License:MIT License


Languages

Language:Python 85.5%Language:Makefile 14.1%Language:Dockerfile 0.5%