replace the following command with your desired s3 location in bootstrap_action.sh
aws s3 cp s3://<s3-bucket>/zeppelin-setup/resources/ zeppelinsetup --recursive
Push the folder setupZeppelin to your desired S3 location.
- Set Zeppelin to use S3 backed notebooks with Spark on Amazon EMR
- Set Anaconda as default python interpreter in Zeppelin
- Setting up Shiro authentication in Zeppelin
Make sure you have the resources before beginning:
- AWS Command line interface installed
- An SSH client
- A key pair in the region where you'll launch the Zeppelin instance
- An S3 bucket in same region to store your Zeppelin notebooks, and to transfer files from EMR to your Zeppelin instance
- IAM permissions to create S3 buckets, launch EC2 instances, and create EMR clusters
The first step is to set up an EMR cluster.
- On the Amazon EMR console, choose Create cluster.
- Choose Go to advanced options and enter the following options:
-
Vendor: Amazon
-
We require Hadoop, Zeppelin, Ganglia, and Spark are selected.
-
Add the bootstrap action.
-
In the Add steps section, for Step type, choose Custom JAR, and select configure.
- Change name to "custom step"
- in
jar location
adds3://region.elasticmapreduce/libs/script-runner/script-runner.jar
replace
region
with the region in which you've created your EMR instance. For example if your region is eu-west-1 the jar location is ins3://eu-west-2.elasticmapreduce/libs/script-runner/script-runner.jar
. The script runner allows you run a script at any time during the step process. 3. Inarguments
adds3://ah-aim-dn-applications/setupZeppelin/step.sh
.
-
- Choose Add, Next.
- On the Hardware Configuration page, select your VPC and the subnet where you want to launch the cluster, keep the default selection of one master and two core nodes of m3.xlarge, and choose Next.
- On the General Options page, give your cluster a name (e.g., Spark-Cluster) and choose Next.
- On the Security Options page, for EC2 key pair, select a key pair. Keep all other settings at the default values and choose Create cluster.
Your three-node cluster takes a few moments to start up. Your cluster is ready when the cluster status is Waiting.
Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management.
The zeppelin-env.sh, and zeppelin-site.xml files are already updated and stored in the resources directory. However, if you'd like to set up again - use the following instructions.
-
the EMR works through IAM profile. So, you don't need to store AWS credentials on EMR.
-
In order to do this, we'll first need to create a S3 bucket with the following folder structure.
bucket_name/ username/ notebook/
-
We can do either of th following methods
-
set the environment variable in the zeppelin-env.sh
export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name export ZEPPELIN_NOTEBOOK_S3_USER = username
-
uncomment and replace value
zeppelin.notebook.s3.user
withusername
and replace value ofzeppelin.notebook.s3.bucket
withbucket_name
in zeppelin-site.xml<!-- Amazon S3 notebook storage --> <!-- Creates the following directory structure: s3://{bucket}/{username}/{notebook-id}/note.json --> <property> <name>zeppelin.notebook.s3.user</name> <value>username</value> <description>user name for s3 folder structure</description> </property> <property> <name>zeppelin.notebook.s3.bucket</name> <value>bucket_name</value> <description>bucket name for notebook storage</description> </property> <property> <name>zeppelin.notebook.s3.endpoint</name> <value>s3.amazonaws.com</value> <description>endpoint for s3 bucket</description> </property> <property> <name>zeppelin.notebook.storage</name> <value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value> <description>notebook persistence layer implementation</description> </property>
-
Note - services on EMR use upstart, and the supported way to restart them is to use sudo stop <service name>
; sudo start <service name>
(the start and stop commands are in /sbin, which is in the PATH by default).