The purpose of this project is to demonstrate how a company can use Natural Language Processing on top of a modern data stack to improve its decision-making quality.
- Building scalable ETL pipelines for high-performance data processing.
- Proficiency in programming with Python using Spark, Polars and Pandas.
- Data Orchestration with Apache Airflow
- Familiarity with AWS services, including S3, Kinesis, EMR, Lambda, Athena, Glue, IAM, and RDS.
- Understanding storage formats such as Parquet, JSON, Avro, and Arrow.
- Working knowledge of databases, including MongoDB and Redshift.
- Understanding of storage format differences and schema designs.
- Knowledge of building machine learning pipelines using tools like SparkML, Tensorflow, Scikit-Learn