caroljmcdonald / mapr-sparkml-streaming-uber

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Install and fire up the Sandbox using the instructions here: http://maprdocs.mapr.com/home/SandboxHadoop/c_sandbox_overview.html. 

Use an SSH client such as Putty (Windows) or Terminal (Mac) to login. See below for an example:
use userid: user01 and password: mapr.

For VMWare use:  $ ssh user01@ipaddress 

For Virtualbox use:  $ ssh user01@127.0.0.1 -p 2222 

You can build this project with Maven using IDEs like Eclipse or NetBeans, and then copy the JAR file to your MapR Sandbox, or you can install Maven on your sandbox and build from the Linux command line, 
for more information on maven, eclipse or netbeans use google search. 

After building the project on your laptop, you can use scp to copy your JAR file from the project target folder to the MapR Sandbox:

use userid: user01 and password: mapr.
For VMWare use:  $ scp  nameoffile.jar  user01@ipaddress:/user/user01/. 

For Virtualbox use:  $ scp -P 2222 nameoffile.jar  user01@127.0.0.1:/user/user01/.  

Copy the data file from the project data folder to the sandbox using scp to this directory /user/user01/data/uber.csv on the sandbox:

For Virtualbox use:  $ scp -P 2222 data/uber.csv  user01@127.0.0.1:/user/user01/data/. 

This example runs on MapR 5.2.1 with Spark 2.1  

By default, Spark classpath doesn't contain spark-streaming-kafka-producer_2.11-2.1.0-mapr-1703.jar and spark-streaming-kafka-0-9_2.11-2.1.0-mapr-1703.jar
which you can get here:  
http://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-0-9_2.11/2.1.0-mapr-1703/spark-streaming-kafka-0-9_2.11-2.1.0-mapr-1703.jar

http://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/spark/spark-streaming-kafka-producer_2.11/spark-streaming-kafka-producer_2.11-2.1.0-mapr-1703.jar


There are 3 ways:
1. Build your application with spark-streaming-kafka-0-9_2.11.jar as dependencies. 
2. Manually download the jars and put to SparkHome/jars folder.
3. Manually download the jars and pass the path to spark submit script with --jars option
read more here http://maprdocs.mapr.com/home/Spark/Spark_IntegrateMapRStreams.html

note use:  mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar if you did not update the spark classpath

Step 0: Create the topics to read from (ubers) and write to (uberp)  in MapR streams:

maprcli stream create -path /user/user01/stream -produceperm p -consumeperm p -topicperm p
maprcli stream topic create -path /user/user01/stream -topic ubers  
maprcli stream topic create -path /user/user01/stream -topic uberp  

to get info on the ubers topic :
maprcli stream topic info -path /user/user01/stream -topic ubers

to delete topics:
maprcli stream topic delete -path /user/user01/stream -topic ubers  
maprcli stream topic delete -path /user/user01/stream -topic uberp  
____________________________________________________________________

Step 1:  Run the Spark k-means program which will create and save the machine learning model: 

spark-submit --class com.sparkml.uber.ClusterUber --master local[2]  mapr-sparkml-streaming-uber-1.0.jar 

you can also copy paste  the code from  ClusterUber.scala into the spark shell 

$spark-shell --master local[2]
 
 - For Yarn you should change --master parameter to yarn-client - "--master yarn-client"

_________________________________________________________________________________  

Alternative Step 1: 

If you want to skip the machine learning and publishing part, the JSON results are in a file in the directory data/cluster.txt
scp this file to /user/user01/data/cluster.txt

Then run the Java producer to produce messages with the topic and data file arguments:

java -cp mapr-sparkml-streaming-uber-1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/user01/stream:uberp /user/user01/data/cluster.txt

To run the MapR Streams Java consumer  with the topic to read from (uberp):

java -cp mapr-sparkml-streaming-uber-1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/user01/stream:uberp 

Then you can skip to Step 4 

____________________________________________________________________

After creating and saving the k-means model you can run the Streaming code:

Step 2: Run the MapR Streams Java producer to produce messages, run the Java producer with the topic (ubers) and data file arguments (uber.csv):

java -cp mapr-sparkml-streaming-uber-1.0.jar:`mapr classpath` com.streamskafka.uber.MsgProducer /user/user01/stream:ubers /user/user01/data/uber.csv

Optional: run the MapR Streams Java consumer to see what was published :

java -cp mapr-sparkml-streaming-uber-1.0.jar:`mapr classpath` com.streamskafka.uber.MsgConsumer /user/user01/stream:ubers


Step 3:  Run the Spark Consumer Producer with the arguments: path to model, subscribe topic, publish topic :
(in separate consoles if you want to run at the same time)

spark-submit --class com.sparkkafka.uber.SparkKafkaConsumerProducer --master local[2] \
 mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar /user/user01/data/savemodel  /user/user01/stream:ubers /user/user01/stream:uberp

____________________________________________________________________


Step 4: Run the Spark Consumer which consumes the enriched messages with the topic to read from (uberp):

spark-submit --class com.sparkkafka.uber.SparkKafkaConsumer --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:uberp

____________________________________________________________________


Step 5:  Spark Streaming writing to MapR-DB HBase

Configure the HBase connector as documented here
http://maprdocs.mapr.com/home/Spark/ConfigureSparkHBaseConnector.html

start the hbase shell and create a table: 
hbase shell
create '/user/user01/db/uber', {NAME=>'data'}

Run the code to Read from MapR Streams and write to MapR-DB HBAse

spark-submit --class com.sparkkafka.uber.SparkKafkaConsumerWriteHBase --master local[2] \
mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar /user/user01/stream:uberp  "/user/user01/db/uber"

start the hbase shell and scan to see results: 
hbase shell
scan '/user/user01/db/uber' , {'LIMIT' => 5}

____________________________________________________________________


Step 6:  Spark Reading from  MapR-DB HBase


Run the code to Read from  MapR-DB HBAse into a spark dataframe

spark-submit --class com.sparkkafka.uber.SparkHBaseReadDF --master local[2] \
mapr-sparkml-streaming-uber-1.0-jar-with-dependencies.jar "/user/user01/db/uber"

_________________________________________________________________________________

to delete the ubers topic after using :
maprcli stream topic delete -path /user/user01/stream -topic ubers

_________________________________________________________________________________

Other examples included:  
from https://github.com/mapr/spark/tree/2.1.0-mapr-1703/examples/src/main/scala/org/apache/spark/examples/streaming

Spark Streaming Consuming from MapR-Streams:

spark-submit --class com.sparkkafka.uber.V09DirectKafkaWordCount --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:ubers

Spark Streaming  publishing to  MapR-Streams:
spark-submit --class com.sparkkafka.uber.KafkaProducerExample  --master local[2] mapr-sparkml-streaming-uber-1.0.jar /user/user01/stream:ubert




About


Languages

Language:Scala 87.8%Language:Java 12.2%