Sabareesh19 / Sort-on-Hadoop-Spark

Sorting of large dataset files(80GB) using Hadoop(Mapreduce) techniques and Apache Spark in Java and scheduled job on the virtual cluster(using 4 nodes) using a SLURM scheduler with bash scripting

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Username A20396634
Name Sabareesh Suresh

Follow these instructions to execute the HadoopSort.java

javac -classpath $(hadoop classpath) HadoopSort.java   //Compile the haddopsort
jar cvf HadoopSort.jar *.class  //to combine all the class and put as a single jar file
hadoop jar HadoopSort.jar HadoopSort /input/data-8GB /user/ssuresh14/output   //change the output name everytime
hadoop jar /opt/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar teravalidate /user/ssuresh14/output /user/ssuresh14/report
hadoop fs -get /user/ssuresh14/report/part-r-00000

While executing the Hadoop sort for 20GB and 80GB /input/data-20GB /user/ssuresh14/output and /input/data-80GB /user/ssuresh14/output accordingly.

To ease the HadoopSort execution run the slurm files
1.sbatch hadoopsort8GB.slurm
2.sbatch hadoopsort20GB.slurm
3.sbatch hadoopsort80GB.slurm



Follow these instructions to execute the SparkSort.java

$ javac -classpath /opt/spark-2.3.0-bin-hadoop2.7/jars/spark-core_2.11-2.3.0.jar:/opt/spark-2.3.0-bin-hadoop2.7/jars/spark-sql_2.11-2.3.0.jar:/opt/spark-2.3.0-bin-hadoop2.7/jars/scala-compiler-2.11.8.jar:/opt/spark-2.3.0-bin-hadoop2.7/jars/scala-library-2.11.8.jar SparkSort.java
$ jar cvf SparkSort.jar SparkSort*.class # make sure to include only the class files for the Spark implemetation
$ spark-submit --class SparkSort --master yarn --deploy-mode client --driver-memory 1g --executor-memory 1g --executor-cores 1 --num-executors 1 SparkSort.jar /input/data-8GB /user/ssuresh14/outputspark9
$ hadoop jar /opt/hadoop-2.9.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar teravalidate /user/ssuresh14/outputspark9 /user/ssuresh14/reportspark9
$ hadoop fs -get /user/ssuresh14/reportspark9/part-r-00000

To ease the SparkSort execution run the slurm files
1.sbatch sparksort8GB.slurm
2.sbatch sparksort20GB.slurm
3.sbatch sparksort80GB.slurm

The respective output files are generated for hadoop and spark in the output folder as per the specified names in the slurm files.


About

Sorting of large dataset files(80GB) using Hadoop(Mapreduce) techniques and Apache Spark in Java and scheduled job on the virtual cluster(using 4 nodes) using a SLURM scheduler with bash scripting


Languages

Language:Shell 53.0%Language:Java 47.0%