Software Engineer-Data Engineer's repositories
divith-raju-Immigration-Data-Engineering
A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)
divith-raju-OpenMetadata
Open Standard for Metadata. A Single place to Discover, Collaborate and Get your data right.
divith-aju-Hadoop-Pyspark-pipeline
This project demonstrates the creation of a scalable data processing pipeline for handling and analyzing log data from a hypothetical e-commerce platform. Leveraging Hadoop and PySpark, the pipeline is designed to process large volumes of log files, providing meaningful insights into user behavior, system performance, and sales metrics.
divith-raju-Building-Big-Data-Infrastucture-NoSQL-And-SQL
Big Data Platform on MongoDB Atlas and Heroku PostgreSQL
divith-raju-postgreSQL
Implementing PostgresSQL best practices for Data Engineer
Divithraju
Config files for my GitHub profile.
awesome-spark
A curated list of awesome Apache Spark packages and resources.
divith-raju-big-data-projects
divith-raju-big-data-tools
divith-raju-Customer-Sales-ETL-Pipeline
This ETL project was designed to demonstrate the development of a scalable data pipeline for customer sales analysis. It covers all essential steps, from data extraction to transformation and loading into a database, with Apache Airflow used.
divith-raju-Data-Mining
This project focuses on customer segmentation using data mining techniques, specifically K-Means clustering, to classify customers into distinct groups based on their purchasing behaviors. The goal is to analyze customer data and segment them into clusters for targeted marketing strategies and better customer relationship management.
divith-raju-ETL-Airflow-Project
This ETL pipeline project is a practical demonstration of my skills in data engineering and automation using Python and Apache Airflow. By integrating MySQL for data storage and leveraging Airflow for task orchestration, the project simulates a scalable and modular ETL solution often required in enterprise data workflows.
divith-raju-pipeline-hadoop-pyspark
This project presents a comprehensive data pipeline designed to predict customer churn using historical customer data. By leveraging Hadoop and PySpark, this pipeline efficiently processes large datasets, performs feature engineering, and trains a machine learning model to identify customers at risk of leaving.
divith-raju-Python
This repository highlights my ability to develop and integrate diverse Python solutions, ranging from API creation and data management to cloud service integration. Each project in this repository serves a specific purpose, demonstrating both fundamental concepts and practical applications that are essential in real-world software development.
divith-raju-Webapplication-Spark-memory-cal
The Spark Memory Configuration Calculator is designed to help data engineers and Spark developers quickly determine the optimal memory and core configurations for their Spark clusters. With this tool, you can avoid common pitfalls and ensure your cluster resources are used efficiently, leading to better performance and lower costs.
pyspark-example-project
Implementing best practices for PySpark ETL jobs and applications.
pyspark-examples
Pyspark RDD, DataFrame and Dataset Examples in Python language