TDP Getting Started
Use this repository to have a working directory where you run deploy commands with predefined virtual infrastructure with Vagrant or your own infrastructure. You can customize the infrastructure and components of your cluster with 1 command per component.
Setup script
To help users to setup the getting started, a setup script is located here.
Features and requirements
The -e
option is used to enable features, for example, -e extras
enables TDP Collection Extras. Each feature have requirements, so if you enable a feature, you MUST install the feature requirements BEFORE launching the setup script.
Common requirements are:
- Python = 3.6 with virtual env package (i.e.
python3-venv
) - Unzip (to execute the setup scripts)
jq
required to execute helper script
Python requirements like Ansible and Mitogen are listed in the file requirements.txt
. The virtual environment is populated with these requirements. Therefore, you should not install them by yourself outside of the virtual environment. Only versions described in requirements.txt
are supported.
Specific features requirements are:
vagrant
: see requirements in https://github.com/TOSIT-IO/tdp-vagrantui
: see requirements in https://github.com/TOSIT-IO/tdp-ui.
Release
The -r
option is for selecting stable
or latest
version for all features. stable
is recommended if you are new to TDP. We try our best to have a stable
release working. latest
is recommended when you want to contribute to TDP.
Clean mode
By default, the setup script will not delete your custom configuration, for example, the .env
file is generated by the setup script and this file is read by tdp-lib
and tdp-server
. You can change it to your preference and rerun the setup script without losing your modification.
If you want a clean configuration (i.e. remove your custom change), you can add the -c
flag and the setup script will remove and create files, symlink, directory it creates.
Quick Start
The below steps will deploy a TDP cluster with all features (see the line with ./scripts/setup.sh
and multiple -e
option) so you MUST install all requirements. If you want specific features, modify the ./scripts/setup.sh
line.
If Vagrant is enabled, the Ansible host.ini
file will be generated using the hosts
variable in tdp-vagrant/vagrant.yml
.
Prerequisites
# Clone project from version control
git clone https://github.com/TOSIT-IO/tdp-getting-started.git
# Move into project dir
cd tdp-getting-started
# Setup local env with stable tdp-collection (mandatory), tdp-lib (mandatory), tdp-server, tdp-ui, tdp-collection-extras, tdp-observability, tdp-collection-prerequisites, and vagrant
# Modify the line below for your needs
./scripts/setup.sh -e server -e ui -e extras -e observability -e prerequisites -e vagrant
# Activate Python virtual env and set environment variables
source ./venv/bin/activate && source .env
# To enable mitogen (optional)
export ANSIBLE_STRATEGY_PLUGINS="$(python -c 'import os,ansible_mitogen; print(os.path.dirname(ansible_mitogen.__file__))')/plugins/strategy"
export ANSIBLE_STRATEGY="mitogen_linear"
# Launch VMs
vagrant up
# Configure TDP Collection Prerequisites
ansible-playbook ansible_collections/tosit/tdp_prerequisites/playbooks/all.yml
You have four ways to deploy a TDP cluster, using TDP UI, using TDP Server API, using TDP Lib CLI, or using Ansible playbook.
Deploy with TDP UI
# Open a new terminal and activate python virtual env
source ./venv/bin/activate && source .env
# Start tdp-server
uvicorn tdp_server.main:app --reload
# Open a new terminal and start tdp-ui
npm --prefix ./tdp-ui run dev
In the UI, click on "Deployments", "New deployment", "Deploy from the DAG", "Preview" (by default all services are deployed), "Deploy".
You can see the deployment in the "Deployments" page. Wait deployment to complete.
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Deploy with TDP Server API
# Open a new terminal and activate python virtual env
source ./venv/bin/activate && source .env
# Start tdp-server
uvicorn tdp_server.main:app --reload
# Deploy TDP cluster core and extras services
curl -X POST http://localhost:8000/api/v1/deploy/dag
# You can see the log in the tdp-server output (the terminal where uvicorn is running)
# Wait deployment
while ! curl -s http://localhost:8000/api/v1/deploy/status | grep -q "no deployment on-going"; do sleep 10; done
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Deploy with TDP Lib CLI
# Deploy TDP cluster core and extras services
tdp deploy
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Deploy with Ansible playbook
# Deploy TDP cluster core services
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.yml
# Deploy extras services
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy-spark3.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/zookeeper-kafka.yml
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/kafka.yml
# Deploy observability services
ansible-playbook ansible_collections/tosit/tdp_observability/playbooks/meta/prometheus.yml
ansible-playbook ansible_collections/tosit/tdp_observability/playbooks/meta/grafana.yml
# Configure HDFS user home directories
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
# Configure Ranger policies
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Web UIs Links
- HDFS NN Master 01
- HDFS NN Master 02
- YARN RM Master 01
- YARN RM Master 02
- MapReduce Job History Server
- HBase Master 01
- HBase Master 02
- Spark History Server
- Spark3 History Server
- Ranger Admin
- JupyterHub
Note: All the WebUIs are Kerberized, you need to have a working Kerberos client on your host, configure the KDC in your /etc/krb5.conf
file and obtain a valid ticket. You can also access the WebUIs through Knox.
Customised Deployment
Each of the below sections includes a high-level explanation of each possible step of TDP deployment.
Environment Setup
Execute the setup.sh
script to create the project directories needed and clone stable or latest Ansible TDP collections. It also downloads the TDP binaries from their GitHub releases (e.g., Hadoop).
Note: The list of TDP binaries needed for deployment is maintained in the scripts/tdp-release-uris.txt
file.
# Get stable tdp-collection
./scripts/setup.sh
# Get latest tdp-collection, tdp-collection-extras, tdp-observability, tdp-collection-prerequisites, and vagrant
./scripts/setup.sh -e extras -e observability -e prerequisites -e vagrant -r latest
Configure infrastructure
Use TDP Vagrant
To use tdp-vagrant
it is necessary to use the -e vagrant
option when using setup.sh
.
You can define vagrant.yml
file to update the machine resources according to your machine's RAM and core count (3Gb of RAM and 4 cores is ideal for the master nodes). The file tdp-vagrant/vagrant.yml
contains default values.
cp tdp-vagrant/vagrant.yml .
# Now you can edit ./vagrant.yml
Important: Do not modify tdp-vagrant/vagrant.yml
to make it easier to update git submodule. The Vagrantfile will read vagrant.yml
in the current directory and fallback to tdp-vagrant/vagrant.yml
.
Start VMs with vagrant
command.
vagrant up
For TDP Vagrant usage see https://github.com/TOSIT-IO/tdp-vagrant.
Note: The helper.sh
script can generate the list of hosts in the cluster. Add the generated lines to your /etc/hosts
file to resolve the local nodes from your shell or browser.
./scripts/helper.sh -h
Configure prerequisites
To use tdp-collection-prerequisites
it is necessary to use the -e prerequisites
option when using setup.sh
.
ansible-playbook ansible_collections/tosit/tdp_prerequisites/playbooks/all.yml
This playbook deploys the following services: Chrony, a CA, a LDAP, a KDC, a PostgreSQL.
For TDP prerequisites usage see https://github.com/TOSIT-IO/tdp-collection-prerequisites.
Core Services Deployment
TDP Lib command
tdp deploy
This command deploys all core and extra (if enable during setup) services.
For TDP Lib usage see https://github.com/TOSIT-IO/tdp-lib.
Main playbook
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/all.yml
This playbook deploys the following services: Exporter, ZooKeeper, Hadoop core (HDFS, YARN, MapReduce), Ranger, Hive, Spark (2 and 3), HBase and Knox. It does not deploy extras services (see Extras Services Deployment to deploy it).
For TDP usage see https://github.com/TOSIT-IO/tdp-collection.
Exporter
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/exporter.yml
Zookeeper
Deploys Apache ZooKeeper to the [zk]
Ansible group and starts a 3 node Zookeeper Quorum.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/zookeeper.yml
Run echo stat | nc localhost 2181
from any node in the [zk]
group to see its ZooKeeper status.
Ranger
Deploys Ranger to the [ranger_admin]
Ansible group.
Note that any changes to the [ranger_admin]
hosts should also be reflected in the [hadoop client group
].
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/ranger.yml
The Ranger UI can be accessed at the address https://<master-02.tdp ip>:6182/login.jsp
and the user admin
and password RangerAdmin123
(assuming default ranger_admin_password
parameter). You may need to import the root.pem
certificate authority into your browser or accept the SSL exception.
Launch HDFS, YARN & MapReduce
Launches HDFS, YARN, and deploys MapReduce clients.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hadoop.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hdfs.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/yarn.yml
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
tdp_user
can access and write to its HDFS user directory:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
echo "This is the first line." | hdfs dfs -put - /user/tdp_user/test-file.txt
echo "This is the second (appended) line." | hdfs dfs -appendToFile - /user/tdp_user/test-file.txt
hdfs dfs -cat /user/tdp_user/test-file.txt
Hive
Deploys Hive to the [hive_s2]
Ansible group. HDFS filesystem is created and the service is launched.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hive.yml
To interact with Hive, use the beeline
CLI:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
export hive_truststore_password='Truststore123!'
# Connect to a random HiveServer2 using ZooKeeper
beeline -u "jdbc:hive2://master-01.tdp:2181,master-02.tdp:2181,master-03.tdp:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}"
# Or directly to a HiveServer2
beeline -u "jdbc:hive2://master-03.tdp:10001/;principal=hive/_HOST@REALM.TDP;transportMode=http;httpPath=cliservice;ssl=true;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}"
# You can also use `beeline` alone which will default to the ZooKeeper mode
beeline
From the Beeline shell:
# Create the database
CREATE DATABASE IF NOT EXISTS tdp_user LOCATION '/user/tdp_user/warehouse/tdp_user.db';
USE tdp_user;
# Examine the database
SHOW DATABASES;
SHOW TABLES;
# Modify the database
CREATE TABLE IF NOT EXISTS table1 (
col1 INT COMMENT 'Integer Column',
col2 STRING COMMENT 'String Column'
);
# Examine the database
SHOW TABLES;
# Modify the database table
INSERT INTO TABLE table1 VALUES (1, 'one'), (2, 'two');
# Examine the database table
SELECT * FROM table1;
Spark
Deploys spark installations to the [spark_hs]
and the [spark_client]
Ansible group.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/spark.yml
To submit a Spark application:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Run a spark application locally
spark-submit --class org.apache.spark.examples.SparkPi --master local[4] /opt/tdp/spark/examples/jars/spark-examples_2.11-2.3.5-TDP-0.1.0-SNAPSHOT.jar 100
# Or spark-submit a spark application to yarn
spark-submit --class org.apache.spark.examples.SparkPi --master yarn /opt/tdp/spark/examples/jars/spark-examples_2.11-2.3.5-TDP-0.1.0-SNAPSHOT.jar 100
Note: Other Spark CLIs are available: pyspark
, spark-shell
, spark-sql
.
Spark 3
Deploys spark3 installations to the [spark3_hs]
and the [spark3_client]
Ansible group.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/spark3.yml
Spark 3 is installed alongside Spark 2 and can be used exactly the same way. The Spark 3 CLIs are: spark3-submit
, spark3-shell
, spark3-sql
, pyspark3
.
HBase
Deploys HBase masters, regionservers, rest and clients to the [hbase_master]
, [hbase_rs]
, [hbase_rest]
and [hbase_client]
Ansible groups respectively.
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/hbase.yml
As tdp_user
on edge-01
, obtain a Kerberos TGT with the command kinit -ki
and access the HBase shell with the command hbase shell
.
Commands such as the below can be used to test your HBase deployment:
list
list_namespace
create 'tdp_user_table', 'cf'
put 'tdp_user_table', 'row1', 'cf:testColumn', 'testValue'
scan 'tdp_user_table'
disable 'tdp_user_table'
drop 'tdp_user_table'
Knox
Deploys Knox Gateway on the [knox]
Ansible group:
ansible-playbook ansible_collections/tosit/tdp/playbooks/meta/knox.yml
You can then access the WebUIs of the TDP services through Knox:
- HDFS NN
- YARN RM
- MapReduce Job History Server
- HBase Master
- Spark History Server
- Spark3 History Server
- Ranger Admin
Note: You can login to Knox using the tdp_user
that is created in the next step.
Extras Services Deployment
Livy
Deploys Livy Server on the [livy_server]
group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy.yml
The Livy Server can be accessed at https://edge-01.tdp:8998 After deployment, one can create a Spark session and interact with it through cURL:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Create a session
curl -k -u : --negotiate -X POST https://edge-01.tdp:8998/sessions \
-d '{"kind": "pyspark"}' -H 'Content-Type: application/json'
# Get the session status (wait until it is "idle")
curl -k -u : --negotiate -X GET https://edge-01.tdp:8998/sessions
# Submit a snippet of code to the session
curl -k -u : --negotiate -X POST https://edge-01.tdp:8998/sessions/0/statements \
-d '{"code": "1 + 1"}' -H 'Content-Type: application/json'
# Get the statement result
curl -k -u : --negotiate -X GET https://edge-01.tdp:8998/sessions/0/statements/0
Livy for Spark 3
Another Livy server is deployed for Spark 3 on the [livy-spark3_server]
group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/livy-spark3.yml
The default port is different than the regular Livy server: 8999
instead of 8998
.
Zookeeper Kafka
Deploys Apache ZooKeeper to the [zk_kafka]
Ansible group and starts a 3 node Zookeeper Quorum dedicated to Kafka.
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/zookeeper-kafka.yml
Kafka
Deploys a Kafka cluster on the [kafka_broker]
group hosts:
ansible-playbook ansible_collections/tosit/tdp_extra/playbooks/meta/kafka.yml
The Kafka CLIs are available on the edge node for all users and client properties files are in /etc/kafka/conf/*.properties
. After deployment, one can interact with Kafka from edge-01.tdp
:
# From edge-01.tdp
sudo su - tdp_user
kinit -ki
# Create a topic
kafka-topics.sh --create --topic test-topic \
--command-config /etc/kafka/conf/client.properties
# Write messages to the topic
kafka-console-producer.sh --topic test-topic \
--producer.config /etc/kafka/conf/producer.properties
>Hello there
>I am writting messages to a Kafka topic
>How cool is that?
>^C # CTRL+C to leave the console producer
# Read all messages from the topic
kafka-console-consumer.sh --topic test-topic --from-beginning \
--consumer.config /etc/kafka/conf/consumer.properties
Utils
Configure HDFS user home directories
Create, update, remove HDFS user home directories.
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/hdfs_user_homes.yml
Additional users can be added to the Ansible variable hdfs_user_homes
if required.
When adding users following the Ranger Usersync deployment, you will need to add or update Ranger policies including these new users. You must wait for Ranger Usersync to poll users from LDAP or you can restart the Ranger Usersync using the following playbook:
ansible-playbook ansible_collections/tosit/tdp/playbooks/ranger_usersync_restart.yml
Configure Ranger policies
Create, update, remove Ranger policies.
ansible-playbook ansible_collections/tosit/tdp/playbooks/utils/ranger_policies.yml
Additional policies can be added to the Ansible variable ranger_policies
if required.