There are 0 repository under spark-rdd topic.
PySpark-Tutorial provides basic algorithms using PySpark
Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University
Data cleaning, pre-processing, and Analytics on a million movies using Spark and Scala.
Various data stream/batch process demo with Apache Scala Spark 🚀
This project builds a scalable log analytics pipeline use Lambda architecture for real-time and batch processing of NASA server logs.
Implementation of Girvan-Newman Algorithm to detect communities in graphs using Yelp dataset
This project utilizes PySpark DataFrames and PySpark RDD to implement item-based collaborative filtering. By calculating cosine similarity scores or identifying movies with the highest number of shared viewers, the system recommends 10 similar movies for a given target movie that aligns users’ preferences.
Apache spark is a big data analysis framework.
This project utilizes PySpark RDD and the Breadth-first Search (BFS) algorithm to find the shortest path and degrees of separation between two given Marvel superheroes based on based on their appearances together in the same comic books, empowering users to discover connections between their favourite superheroes in the Marvel universe.
I have implemented the sample programs using apache spark. The programs have developed on the concepts of Spark RDD and Spark SQL Dataframe.
spark hadoop exercise of cloud computing course - aut 1402-1403 fall
Projects contains based on Big Data
In this project, we use Spark to visualize, manipulate, model and stream historical flight-delays data using Spark RDD, Spark SQL and Kafka
Demonstration of basic data transformations using Spark RDD and Spark DataFrame in Scala
Collection of PySpark programs and projects demonstrating the use of Apache Spark's Python API for big data processing and analysis. It includes practical implementations such as logistic regression classification, data analysis on the Iris dataset, and basic PySpark operations like temperature conversion.
This program will process legal report via Stanford CoreNLP and index them in ElasticSearch
This repository demonstrates big data processing, visualization, and machine learning using tools such as Hadoop, Spark, Kafka, and Python.
The Chicago Energy Usage Analysis project aims to explore energy consumption patterns in Chicago using big data techniques. Leveraging Apache Spark, it processes a dataset of 67051 records to provide actionable insights for urban planning and energy efficiency initiatives.
A POC written in Java using the Spring framework, which uses Apache Spark to read a file from Amazon S3 FS and counts the number of lines in the file.
Example Spark project using Parquet as a columnar store with Thrift objects.
A notes for Coursera Course: Big Data Essentials - HDFS, MapReduce and Spark RDD
The goal is to train a linear regression model to predict Deerfoot commute times given weather and accident conditions using Spark RDD and MLlib