Realtime and Big Data Analytics

CSCI-GA.3033-008 NYU Courant Institute of Mathematical Sciences Computer Science Department, Graduate Division Spring 2013

Technologies

We use Apache Hadoop/MapReduce to analyze data sets to extract patterns within them.
We use Apache HBase to store data that we will later query from our UI.
We use Apache Mahout to cluster users using K-Means Algorithm (with SequenceFiles and NamedVectors representing User -> Weights of Tags of Songs Listened)
We use HDFS for keeping the large amounts of data which is not possible with running Hadoop in local/standalone mode.
We use SQLite to store data is not in huge quantities and to retrieve from the UI.

Analytics

We try to answer the following from the Million Song Dataset:

What's the most listened-to song? (100% Completed)
Who's the most listened-to artist? (100% Completed)
What's an artist's top songs? (100% Completed)
Plot a graph of the artist's song energies (0 - 100) vs. number of songs. (100% Completed)
What are an artist's similar artists? (100% Completed)

About

This is the final project for the Master's level course Realtime and Big Data Analytics at New York University. In this project we analyze hidden patterns and extract information from datasets provided by Million Song Dataset, Last.fm, MusicBrainz and EchoNest.

Languages

Language:Java 100.0%