Clustering and Unsupervised Learning

Datasets and example code for Lesson 3 in Oracle Academy's Data Science Bootcamp. Raw data for the Hadoop-based canopy clustering is in cluster_sample.xml. Raw data for clustering in R is in cluster_set_clean.dat. Commands for the Hadoop and Hive clustering example are in canopy_commands.txt. Commands for the R clustering example are in clustering.R.

For this lesson:

Copy cluster_sample.xml into HDFS
Follow the steps in canopy_commands.txt

Create a dataset for canopy clustering
Create a dataset for k-means clustering in R

Compile and run the canopy clusterer
Load the R dataset into R and experiment with clustering in R

About

Lesson 3 in Oracle Academy's Data Science Bootcamp: Unsupervised Learning and Clustering

Languages

Language:Java 88.7%Language:R 7.9%Language:Shell 3.4%