Fragmenty

What is the project?

This is a small side project that crawls the Fragment Telegram platform to extract data about phone numbers, and provides a RESTful API, WebSocket API, and visualization of the data through a chart.

The goal of this project is to extract data and basic insights about Telegram numbers auction, also learn more about the Play framework, Scala, Terraform and AWS.

Stack:

Scrapy framework and Python language for Crawler part
Play framework and Scala language for API server
Plotly for data visualization
MongoDB as data persistence
Terraform infrastructure automation for provisioning
AWS service cloud infrastructure
MongoDB Atlas cloud database service

Infrastructure Architecture

This project uses Amazon Web Services (AWS) for infrastructure provisioning. The infrastructure is organized into different components, with each component residing in its own directory under the fragmenty-infra directory.

Components:

Elastic Container Service (ECS) - Deploy and manage the containerized applications
MongoDB Atlas - Host the MongoDB instance for data persistence

Elastic Container Service (ECS)

The ECS infrastructure is set up using Terraform and includes the following resources:

Elastic Container Registry (ECR) for storing container images
ECS Cluster, ECS Service, and ECS Task Definition for running the containerized applications
AWS Lambda for running the Scrapy crawler periodically
Application Load Balancer (ALB) for distributing traffic to the ECS tasks
Route 53 for managing DNS records
AWS Certificate Manager (ACM) for SSL certificate provisioning

MongoDB Atlas

The MongoDB Atlas infrastructure is also set up using Terraform and consists of the following resources:

MongoDB Atlas Cluster
MongoDB Atlas Database Users

Visualized Terraform graph

Deployment Workflow

The deployment process is automated using Terraform. The external.tf file is used to extract the latest Git commit SHA for the spider and api modules. These SHAs are used as container image tags. Terraform uses container_build_push.tf to build and push the container images to the ECR. The ecs.tf file contains the resources required to run the containerized applications on ECS.

The Lambda function, defined in lambda.tf, is responsible for running the Scrapy crawler periodically. The function is triggered by a CloudWatch Event Rule that specifies the desired frequency.

The loadbalancer.tf file defines an Application Load Balancer (ALB) that routes traffic to the ECS tasks. Route 53 is used to create a custom domain name and an SSL certificate, as specified in the route53.tf file.

Git Submodules

This project consists of two Git submodules:

fragmenty-api - This submodule contains the source code for the API server, which is built using the Play framework and Scala. The fragmenty-api directory contains a Dockerfile for building the container image, configuration files, and the application's source code.
fragmenty-spider - This submodule contains the source code for the Scrapy crawler that extracts data from Telegram's Fragment platform. The fragmenty-spider directory contains a Dockerfile for building the container image, a build script, a sample environment file, and the Scrapy spider's source code.

These submodules are automatically checked out when the main repository is cloned with the --recurse-submodules option:

git clone --recurse-submodules https://github.com/Maders/fragmenty.git

Useful Commands

To initialize the working directory, run the following command in the respective directories:

terraform init

To apply the infrastructure changes, run the following command in the respective directories:

terraform apply

To destroy the infrastructure resources, run the following command in the respective directories:

terraform destroy

Maders / fragmenty