t0rn Collator Infrastructure

Architecture

Key components:

VPC with private and public subnet
t0rn-collator in private subnet and NatGW in public
ASG with min, max instances set to 1 for t0rn-collator
Configuration with userdata

Atmos was used to provide a DRY solution

The atmos CLI is a universal tool for DevOps and cloud automation. It allows deploying and destroying Terraform and helmfile components, as well as running workflows to bootstrap or teardown all resources in an account.

atmos includes workflows for dealing with:

Provision large, multi-accountTerraform environments
Deploy helm charts to Kubernetes clusters with helmfile
Execute helm commands on Kubernetes clusters
Executing kubectl commands on Kubernetes clusters

It's a very good and versitile tool.

Note: More documentation

Installing Atmos

macosx brew install atmos
linux

For more options go to docs

Basic Architecture

Note: More documentation

TLDR Bootstrapping infrastructure

Workflow t0rn.yaml will provision two components that are essentials of this work, vpn and t0rn-collator.

atmos workflow -f t0rn.yaml plan

If plan looks good (obviously planning t0rn-collator will fail without VPC) then:

atmos workflow -f t0rn.yaml apply

When the time comes:

atmos workflow -f t0rn.yaml destroy

Important choices

atmos makes IaC DRY and very easy to manage and extend
terraform state is local to not introduce additional complexity
t0rn-collator image was rebuilt to mitigate issues with entrypoint and libssl library (https://hub.docker.com/repository/docker/3h4xx/t0rn-collator)
t0rn-collator image is ran directly by docker with restart policy, restart policy is not fool proof, it's possible it can fail to restart container under certain circumstances (but I have tested most common case of server restart and container is running fine afterwards)
t0rn-collator is deployed in private network with empty SG ingress, outbound taffic goes via NatGW
eu-central-1 was picked as deployment region close to other nodes, which should provide lower latency in sync
t3a-medium is an arbitrary choice which is just to limit costs of development

Security and SRE TODO

AWS multi account where each environment has it's own account
VPC Flow logs enabled
deployment should be in multiple subnets within AZs (not done due to cost of NATGW), otherwise it's not fault tolerant
container orchestration should be done
monitoring should be added as collator have prometheus exporter binded
each provisioned t0rn-collator is starting blockchain synchronization from scratch, this can be improved by using baked AMI
CICD for validation, linting and deployment

About

Languages

Language:HCL 97.7%Language:Shell 2.3%