hive-dev-box

why?

To make some easily accessible environment to run and develop Hive.

Testability aspect

There are sometimes bugreports agains earlier releases; but testing these out sometimes is problematic - running/switching between versions is kinda problematic. I was using some vagrant based box which was usefull doing this...

patch development processes

I'm working on Hive and sometimes on other projects in the last couple years - and since QA runs may come after 8-12 hours; I work on multiple patches simultaneously. However; working on several patches simultaniously has its own problems:

I go thru all the approaches I was using ealier:

basic approach: use a single workspace - and switch the branch...
- unquestionably this is the most simple
- after switching the branch - a full rebuild is neccessary
1 for each: use multiple copies of hive - with have isolated maven caches
- pro:
  - capability to run maven commands simultaneuously on multiple patches
- con:
  - one of the patches have to be "active" to make an IDE able to use it
  - it falls short when it comes to working on patch simultaneous in multiple projects (hive+tez+hadoop)
  - after some time it eats up space...
dockerized/virtualized development environment
- pro:
  - everything is isolated
  - because I'm not anymore bound to my natural environment: I may change a lot of things without interfering with anything else
  - easier to "cleanup" at the end of submitting the patch (just delete the container)
  - ability to have IDEs running for multiple patches at the same time
- con:
  - isolated environment; configuration changes might get lost
  - may waste disk space...

What's the goal of this?

The aim of this project is to provide an easier way to test-drive hive releases

running releases:
- upstream apache releases
- HDP/CDP/CDH releases
- in-development builds
provide an evironment for developing hive patches

Getting started - with running off shelf releases

# build and launch the hive-dev-box container
./run.bash 
# after building the container you will get a prompt inside it
# initialize the metastore with
reinit_metastore
# everything should be ready to launch hive
hive_launch
# exit with CTRL+A CTRL+\ to kill all processes

Getting started - with patch development

make X11 forwarding work (once)

on linux based systems you are already running an xserver
MacOSX users should follow: https://medium.com/@mreichelt/how-to-show-x11-windows-within-docker-on-mac-50759f4b65cb

artifactory cache (once)

Every container will be reaching out to almost the same artifacts; so employing an artifact cache "makes sense" in this case :D

# start artifactory instance
./start_artifactory.bash

You will have to manually configure this instance (once)

It will be available at http://127.0.0.1:8081/ use admin/password to login

make sure to have anonymous acces enabled: ** left menu bar; Admin menu; Security / Security configuration > allow anonymous access is enabled
add some remote repositories ** left menu bar: Admin menu: Repositories / Remote *** add maven central / etc *** or some caching mirror repository if you know one
add the wonder virtual repository ** left menu bar: Admin menu: Repositories / Virtual *** make sure to use the name "wonder" for it *** add remote repos to it

This instance will be linked to the running development environment automatically

set properties (once)(optional)

add an export to your .bashrc or similar; like:

export HIVE_DEV_BOX_HOST_DIR=$HOME/hive-dev-box

The dev environment will assume that you are working on upstream patches; and will always open a new branch forked from master If you skip this; things may not work - you will be left to do these things; in case you are using HIVE_SOURCES env variable you might not need to set it anyway.

# make sure to load the new env variables for bash
. .bashrc
# and also create the host dir beforehand
mkdir $HIVE_DEV_BOX_HOST_DIR

launch - with sources stored inside container

# invoking with an argument names the container and will also be the preffered name for the ws and the development branch
./run.bash HIVE-12121-asd
# when the terminal comes up
# issuing the the following command will clone the sources based on your srcs dsl
srcs hive
# enter hive dir ; and create a local branch based on your requirements
cd hive
git branch `hostname` apache/master
# if you need...patch the sources:
cdpd-patcher hive
#  run a full rebuild
rebuild
# you may run eclipse
dev_eclipse

A shorter version exists for initializing upstream patch development

./run.bash HIVE-12121-asd
# this will clone the source; creates a branch named after the containers hostname; runs a rebuild and open eclipse
hive_patch_development

filesystem layout

beyond the "obvious" /bin and /lib folders there are some which might make it more clear how this works:

/work
- used to store downloaded and expanded artifacts
- if you switch to say apache hive 3.1.1 and then to some other version you shouldn't need to wait for the download and expansion of it..
- this is mounted as a docker volume; and shared between the containers
- files under /work are not changed
/active
- the /work folder may contain a number versions of the same component
- symbolic links point to actually used versions
- at any point doing an ls -l /active gives a brief overview about the active components
/home/dev
- this is the development home
/home/dev/hive
- the Hive sources; in case HIVE_SOURCES is set at launch time; this folder will be mapped to that directory on the host
/home/dev/host
- this is a directory shared with the host; can be used to exchange files (something.patch)
- will also contain the workspace "template"
- bin directory under this folder will be linked as /home/dev/bin so that scripts can be shared between containers and the host

hdb - easier access to running multiple envs

run NAME
- starts a new container with NAME - without attaching to it
enter NAME
- enters into the container

installation:

# create a symlink to hive-dev-box/hdb from an executable location ; eg $HOME/bin ?
ln -s $PWD/hdb $HOME/bin/hdb
# enable bash_completion for hdb
# add the following line to .bashrc
. <($HOME/bin/hdb bash_completion)

sw - switch between versions of things

# use hadoop 3.1.0
sw hadoop 3.1.0
# use hive 2.3.5
sw hive 2.3.5
# use tez 0.8.4
sw tez 0.8.4

reinit_metastore [type]

optionally switch to a different metastore implementation
wipe it clean
populate schema and load sysdb

reinit_metastore [derby|postgres|mysql]

dlavati / hive-dev-box