There are 2 repositories under apachespark topic.
This is a repo with links to everything you'd ever want to learn about data engineering
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
type-class based data cleansing library for Apache Spark SQL
Code for blog at: https://www.startdataengineering.com/post/docker-for-de/
SparkSQL.jl enables Julia programs to work with Apache Spark data using just SQL.
FLaNK AI Weekly covering Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Apache Ozone, Apache Pulsar, and more...
Repository for Lab “Distributed Big Data Analytics” (MA-INF 4223), University of Bonn
This repository contains all the projects and labs I worked on while pursuing professional certificate programs, specializations, and bootcamp. [Areas: Deep Learning, Machine Learning, Applied Data Science].
PySpark es una biblioteca de procesamiento de datos distribuidos en Python que permite procesar grandes volúmenes de datos en clústeres utilizando el framework Apache Spark, ofreciendo un alto rendimiento y un conjunto de herramientas integradas para el análisis y manejo de datos a gran escala.
Trigger spark-submit in Golang. A Go implementation of famous SparkLauncher.java.
Connect to SQL Server using Apache Spark
Examples usages for cleanframes library
Link Prediction is about predicting the future connections in a graph. In this project, Link Prediction is about predicting whether two authors will be collaborating for their future paper or not given the graph of authors who collaborated for atleast one paper together.
Ce dépôt GitHub contient un document détaillé sur les bases du langage Scala.
A Capstone Project that covers several aspects of Data Engineering (Data Exploration, Cleaning, Modeling, Pipelining, Processing)
You will find here the demo codes for my Data+AI 2020 talk about customizing Apache Spark state store.
Use this project to join data from multiple csv files. Currently in this project we support one to one and one to many join. Along with this you can find how to use kafka producer efficiently with spark.
This is a Jupyter Notebook to practice Apache Spark in Google Colab, especially for the exam CCA Spark and Hadoop Developer Exam (CCA175).
Implementation of GraphFrames using PySpark in Eclipse IDE
Data Analysis of bank transaction data
Working with Apache Spark, Creating some small tutorials and at last implemeting a small project
Apache Spark project for Advanced Topics on Databases course
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
This repository showcases IPL data analysis using Apache Spark. The project demonstrates the power of Spark for data transformation, cleaning, SQL queries, and visualization, all performed with PySpark to handle large-scale data efficiently.
This is a distributed system that utilizes Apache Spark through Dataproc. We use the Spotify API to send song data to Apache Spark, which then forwards the information to Google Cloud Services. The system processes this data to recommend songs based on the extracted information.
Projects completed as part of the CSE 6332 CCBD course at UTA, covering distributed computing, data processing frameworks, and cloud platforms.
This comprehensive course is designed for beginners and experienced developers alike, providing an in-depth exploration of Apache Spark
ETL Datapipeline to process Washington's EV data using Apache Spark, Docker, Snowflake, Airflow, AWS services and visualize the transformed parquet data by creating Tableau Dashboards.
Sample project to run databricks job using a java jar and utilising UDFs.
Analysis and visualization of open-source police data from two areas, Leicestershire Street and Northumbria Street to derive data-driven insights