deepakcr7ms7 / ETL-off-a-SQS-Queue

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Fetch Rewards

Data Engineering Take Home: ETL off a SQS Qeueue

To run the code

  1. Clone this repo.
git clone https://github.com/prasadashu/data-engineering-fetch-rewards.git
  1. Go into the cloned repo.
cd data-engineering-fetch-rewards
  1. Run make command to install dependencies.
make pip-install
  1. Run make command to configure aws shell.
make aws-configure
  1. Pull and start docker containers.
make start
  1. Run Python code to perform ETL process.
make perform-etl

Checking messages loaded in Postgres

  • To validate the messages loaded in Postgres
psql -d postgres -U postgres -p 5432 -h localhost -W
  • Credentials and database information

    • username=postgres
    • password=postgres
    • database=postgres
  • If psql binary is not installed on Ubuntu based distros, install it using the below command.

apt install postgresql-client

Decrypting masked PIIs

  • The ip and device_id fields are masked using base64 encryption.
  • To recover the encrypted fields, we can use the below command.
echo -n "<sample_base64_encrypted_string>" | base64 --decrypt

Questions:

1)Deployment in Production:

First , we need to set up the production environment. This may involve creating a virtual machine or a container orchestration platform such as Kubernetes or Amazon ECS.

Then we can deploy the Docker image to the production environment. This can be done using a container orchestration platform such as Kubernetes or Amazon ECS, which can automatically manage the deployment, scaling, and monitoring of your containers.

2) Production-Ready Components:

I containerised the application in Docker.So that part is production ready. But, In order to make this application production-ready, we need to add the following components:

Centralized logging: Using tools like ELK Stack, Splunk, or AWS CloudWatch to collect and analyze logs from different parts of the application can help identify and debug issues quickly.

Monitoring and alerting: Tools like Prometheus, Grafana, or New Relic can provide visibility into application performance, resource utilization, and other key metrics. Alerting can be set up to notify the team when metrics exceed predefined thresholds.

CI/CD pipeline: Automating the building, testing, and deployment of the application using tools like Jenkins, Travis CI, or GitLab CI/CD can help reduce manual errors and improve deployment speed.

Scalability: Implementing horizontal scaling using load balancers like HAProxy, Nginx, or Amazon ELB can help ensure high availability and handle spikes in traffic.

Security: Implementing security measures like encryption, role-based access control, and web application firewalls can help protect against threats like data breaches and DDoS attacks.

Disaster recovery: Implementing backup and recovery processes, including data replication and automated failover, can help ensure business continuity in the event of a disaster.

Performance optimization: Regular performance testing and optimization, including database tuning and resource allocation, can help ensure the application is performing efficiently.

Compliance: Ensuring the application is compliant with relevant regulations and standards, such as HIPAA, PCI DSS, and GDPR, can help mitigate legal and financial risks.

Documentation and training: Providing documentation and training for support and maintenance teams can help ensure the application is well-understood and properly managed over time.

3)Scaling with a Growing Dataset:

To scale this application with a growing dataset, we could take the following approaches depending on our dev environment:

Increase the number of ETL worker instances: As the size of the dataset grows, you may need to increase the number of worker instances that are processing messages from the SQS Queue. This can be achieved by either launching additional EC2 instances or scaling out the containers running the ETL workers in a containerized environment.

Use Autoscaling: Autoscaling can be used to automatically adjust the number of worker instances based on the size of the queue. When the number of messages in the queue grows, additional worker instances can be launched to handle the increased load. When the queue size decreases, the number of worker instances can be scaled down.

Implement a distributed ETL process: A distributed ETL process can be implemented to divide the processing load across multiple worker instances. This can be achieved using technologies like Apache Spark or Apache Flink, which can distribute the processing of data across a cluster of worker nodes.

Optimize the ETL process: The ETL process should be optimized to handle large amounts of data efficiently. This can involve optimizing queries, reducing the number of database or API calls, and using caching mechanisms to reduce the amount of data that needs to be processed.

Use a Data Pipeline service: Use a managed data pipeline service like AWS Glue or Azure Data Factory to automate and manage the ETL process. These services can automatically scale and optimize the processing of data based on the size of the queue.

4) How can PII be recovered later on?

Recovering masked PII is generally not possible, as the purpose of masking is to protect the sensitive data from unauthorized access or exposure. However, there are a few scenarios where some or all of the original PII may be recoverable, typically through key-based or pattern-based attacks, or through side-channel attacks.

5) assumptions I made :

The data in the SQS Queue is in a consistent format.

The SQS Queue can handle the volume of data.

The ETL process can handle duplicates, missing data and delayed messages.

The ETL process can handle errors and failures

About


Languages

Language:Python 87.5%Language:Makefile 8.5%Language:Shell 4.1%