Datasets and example code for Lesson 3 in Oracle Academy's Data Science Bootcamp. Raw data for the Hadoop-based canopy clustering is in cluster_sample.xml. Raw data for clustering in R is in cluster_set_clean.dat. Commands for the Hadoop and Hive clustering example are in canopy_commands.txt. Commands for the R clustering example are in clustering.R.
For this lesson:
- Copy cluster_sample.xml into HDFS
- Follow the steps in canopy_commands.txt
- Create a dataset for canopy clustering
- Create a dataset for k-means clustering in R
- Compile and run the canopy clusterer
- Load the R dataset into R and experiment with clustering in R