📝 Data Ingestion Proof of Concept

This repository contains the specification for the third deliverable of the 'Projects' class. Currently, the services are organized as Docker swarm stack in compose.yml and the infrastructure is organized in Terraform files in terraform/.

Development
Deploying the Stack
- Terraform
- Docker
Contributing
LICENSE

Development

To install the development pre-requisites, please follow the instructions in the links below:

Installing development dependencies

First, change your current working directory to the project's root directory and bootstrap the project:

# change current working directory
$ cd <path/to/cs-data-ingestion>

# bootstraps development and project dependencies
$ make bootstrap

NOTE: By default, poetry creates and manages virtual environments to install project dependencies -- meaning that it will work isolated from your global Python installation. This avoids conflicts with other packages installed in your system.

Deploying the Stack

If you wish to deploy the stack locally, jump to the Docker section. If you wish to deploy the services to AWS, on the other hand, continue to the Terraform section.

Terraform

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers by generating an execution plan describing what it will do to reach the desired state (described in the project's Terraform files), and then executes it to build the described infrastructure. As the configuration changes, Terraform is able to determine what changed and create incremental execution plans which can be applied. For this project, the infrastructure is deployed to AWS.

Configuring AWS Credentials

Follow the instructions in AWS CLI documentation to configure your AWS account locally. After that, update the variable profile to point to your account profile.

Deploying Infrastructure

After you're done configuring your AWS profiles, change your current working directory to where Terraform files are located and initialize it:

# change current working directory
$ cd terraform

# prepares the current working directory for use
$ terraform init

Now, apply the changes required to reach the desired state of the configuration described in the Terraform files. Make sure to correctly reference your SSH Key Pair or else Terraform won't be able to deploy the project's services:

# applies required changes and passes the SSH key pair as parameters
$ terraform apply -var 'key_name=key' -var 'public_key_path=~/.ssh/key.pub'

Note: Make sure the output for SSH Agent is true: SSH Agent: true. In case it isn't, please run $ eval "$(ssh-agent -s)" and $ ssh-add ~/.ssh/key and try again. Also, key should actually be the SSH file from public_key_path (but without the .pub at the end).

At this point, if the project was deployed correctly, you should be able to access the following resources:

Airflow Webserver UI at http://<aws_instance.web.public_ip>:8080
Flask frontend at http://<aws_instance.web.public_ip>:5000

Note: <aws_instance.web.public_ip> is the final output of the $ terraform apply ... command.

Besides the available resources, you may also SSH into the deployed machine at any time:

# connect to provisioned instance via SSH
$ ssh -i ~/.ssh/key.pub ubuntu@<aws_instance.web.public_ip>

In case you are having problems, you may want to look at Hashicorp's Terraform AWS Provider Documentation.

Wrapping up

Once you're done, you may remove what was created by terraform apply:

# change current working directory
$ cd terraform

# destroys the Terraform-managed infrastructure
$ terraform destroy

Docker

Considering that the stack is organized as a Docker swarm stack, the following dependencies must be installed:

Docker

NOTE: If you're using a Linux system, please take a look at Docker's post-installation steps for Linux!

Setup

Once you have Docker installed, pull the Docker images of the services used by the stack:

# fetches services' docker images
$ make docker-pull

Now, build the missing Docker images with the followin command:

# builds services' docker images
$ make docker-build

NOTE: In order to build development images, use $ make docker-build-dev command instead!

Finally, update the env.d files for each service with the appropriate configurations, credentials, and any other necessary information.

NOTE: in order to generate a fernet key for Airflow, please take a look here.

Initialize Swarm mode

In your deployment machine, initialize Docker Swarm mode:

# joins the swarm
$ docker swarm init

Note: For more information on what is Swarm and its key concepts, please refer to Docker's documentation.

Deploying services

Now that the deployment machine is in swarm mode, deploy the stack:

# deploys/updates the stack from the specified file
$ docker stack deploy -c compose.yml cs-data-ingestion

Verifying the Stack's Status

Check if all the services are running and have exactly one replica:

# list the services in the cs-data-ingestion stack
$ docker stack services cs-data-ingestion

You should see something like this:

ID                  NAME                            MODE                REPLICAS            IMAGE                               PORTS
9n8ldih68jnk        cs-data-ingestion_redis               replicated          1/1                 bitnami/redis:6.0
f49nmgkv3v9i        cs-data-ingestion_airflow             replicated          1/1                 bitnami/airflow:1.10.13             *:8080->8080/tcp
fxe80mcl98si        cs-data-ingestion_postgresql          replicated          1/1                 bitnami/postgresql:13.1.0
ii6ak931z3so        cs-data-ingestion_airflow-scheduler   replicated          1/1                 bitnami/airflow-scheduler:1.10.13
vaa3lkoq133d        cs-data-ingestion_airflow-worker      replicated          1/1                 bitnami/airflow-worker:1.10.13
ipsdstxfvnpl        cs-data-ingestion_frontend            replicated          1/1                 cs-data-ingestion:frontend          *:5000->5000/tcp

At this point, the following resources will be available to you:

Airflow Webserver UI is available at http://localhost:8080
Flask frontend is available at http://localhost:5000/v1/render/images

NOTE: In case localhost doesn't work, you may try http://0.0.0.0:<port> instead.

Logging

In order to check a service's logs, use the following command:

# fetch the logs of a service
$ docker service logs <service_name>

NOTE: You may also follow the log output in realtime with the --follow option (e.g. docker service logs --follow cs-data-ingestion_airflow). For more information on service logs, refer to Docker's documentation.

Wrapping up

Once you're done, you may remove what was created by docker swarm init:

# removes the cs-data-ingestion stack from swarm
$ docker stack rm cs-data-ingestion

# leaves the swarm
$ docker swarm leave

NOTE: All the data created by the stack services will be lost. For more information on swarm commands, refer to Docker's documentation.

Contributing

We are always looking for contributors of all skill levels! If you're looking to ease your way into the project, try out a good first issue.

If you are interested in helping contribute to the project, please take a look at our Contributing Guide. Also, feel free to drop in our community chat and say hi. 👋

Also, thank you to all the people who already contributed to the project!

lcbm / cs-data-ingestion

📝 Data Ingestion Proof of Concept

Contents

Development

Installing development dependencies

Deploying the Stack

Terraform

Configuring AWS Credentials

Deploying Infrastructure

Wrapping up

Docker

Setup

Initialize Swarm mode

Deploying services

Verifying the Stack's Status

Logging

Wrapping up

Contributing

License

About

Languages