dlavati / hive-dev-box

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hive-dev-box

why?

To make some easily accessible environment to run and develop Hive.

Testability aspect

There are sometimes bugreports agains earlier releases; but testing these out sometimes is problematic - running/switching between versions is kinda problematic. I was using some vagrant based box which was usefull doing this...

patch development processes

I'm working on Hive and sometimes on other projects in the last couple years - and since QA runs may come after 8-12 hours; I work on multiple patches simultaneously. However; working on several patches simultaniously has its own problems:

I go thru all the approaches I was using ealier:

  • basic approach: use a single workspace - and switch the branch...
    • unquestionably this is the most simple
    • after switching the branch - a full rebuild is neccessary
  • 1 for each: use multiple copies of hive - with have isolated maven caches
    • pro:
      • capability to run maven commands simultaneuously on multiple patches
    • con:
      • one of the patches have to be "active" to make an IDE able to use it
      • it falls short when it comes to working on patch simultaneous in multiple projects (hive+tez+hadoop)
      • after some time it eats up space...
  • dockerized/virtualized development environment
    • pro:
      • everything is isolated
      • because I'm not anymore bound to my natural environment: I may change a lot of things without interfering with anything else
      • easier to "cleanup" at the end of submitting the patch (just delete the container)
      • ability to have IDEs running for multiple patches at the same time
    • con:
      • isolated environment; configuration changes might get lost
      • may waste disk space...

What's the goal of this?

The aim of this project is to provide an easier way to test-drive hive releases

  • running releases:
    • upstream apache releases
    • HDP/CDP/CDH releases
    • in-development builds
  • provide an evironment for developing hive patches

Getting started - with running off shelf releases

# build and launch the hive-dev-box container
./run.bash 
# after building the container you will get a prompt inside it
# initialize the metastore with
reinit_metastore
# everything should be ready to launch hive
hive_launch
# exit with CTRL+A CTRL+\ to kill all processes

Getting started - with patch development

make X11 forwarding work (once)

artifactory cache (once)

Every container will be reaching out to almost the same artifacts; so employing an artifact cache "makes sense" in this case :D

# start artifactory instance
./start_artifactory.bash

You will have to manually configure this instance (once)

It will be available at http://127.0.0.1:8081/ use admin/password to login

  • make sure to have anonymous acces enabled: ** left menu bar; Admin menu; Security / Security configuration > allow anonymous access is enabled
  • add some remote repositories ** left menu bar: Admin menu: Repositories / Remote *** add maven central / etc *** or some caching mirror repository if you know one
  • add the wonder virtual repository ** left menu bar: Admin menu: Repositories / Virtual *** make sure to use the name "wonder" for it *** add remote repos to it

This instance will be linked to the running development environment automatically

set properties (once)(optional)

add an export to your .bashrc or similar; like:

export HIVE_DEV_BOX_HOST_DIR=$HOME/hive-dev-box

The dev environment will assume that you are working on upstream patches; and will always open a new branch forked from master If you skip this; things may not work - you will be left to do these things; in case you are using HIVE_SOURCES env variable you might not need to set it anyway.

# make sure to load the new env variables for bash
. .bashrc
# and also create the host dir beforehand
mkdir $HIVE_DEV_BOX_HOST_DIR

launch - with sources stored inside container

# invoking with an argument names the container and will also be the preffered name for the ws and the development branch
./run.bash HIVE-12121-asd
# when the terminal comes up
# issuing the the following command will clone the sources based on your srcs dsl
srcs hive
# enter hive dir ; and create a local branch based on your requirements
cd hive
git branch `hostname` apache/master
# if you need...patch the sources:
cdpd-patcher hive
#  run a full rebuild
rebuild
# you may run eclipse
dev_eclipse

A shorter version exists for initializing upstream patch development

./run.bash HIVE-12121-asd
# this will clone the source; creates a branch named after the containers hostname; runs a rebuild and open eclipse
hive_patch_development

filesystem layout

beyond the "obvious" /bin and /lib folders there are some which might make it more clear how this works:

  • /work
    • used to store downloaded and expanded artifacts
    • if you switch to say apache hive 3.1.1 and then to some other version you shouldn't need to wait for the download and expansion of it..
    • this is mounted as a docker volume; and shared between the containers
    • files under /work are not changed
  • /active
    • the /work folder may contain a number versions of the same component
    • symbolic links point to actually used versions
    • at any point doing an ls -l /active gives a brief overview about the active components
  • /home/dev
    • this is the development home
  • /home/dev/hive
    • the Hive sources; in case HIVE_SOURCES is set at launch time; this folder will be mapped to that directory on the host
  • /home/dev/host
    • this is a directory shared with the host; can be used to exchange files (something.patch)
    • will also contain the workspace "template"
    • bin directory under this folder will be linked as /home/dev/bin so that scripts can be shared between containers and the host

hdb - easier access to running multiple envs

  • run NAME
    • starts a new container with NAME - without attaching to it
  • enter NAME
    • enters into the container

installation:

# create a symlink to hive-dev-box/hdb from an executable location ; eg $HOME/bin ?
ln -s $PWD/hdb $HOME/bin/hdb
# enable bash_completion for hdb
# add the following line to .bashrc
. <($HOME/bin/hdb bash_completion)

sw - switch between versions of things

# use hadoop 3.1.0
sw hadoop 3.1.0
# use hive 2.3.5
sw hive 2.3.5
# use tez 0.8.4
sw tez 0.8.4

reinit_metastore [type]

  • optionally switch to a different metastore implementation
  • wipe it clean
  • populate schema and load sysdb
reinit_metastore [derby|postgres|mysql]

About


Languages

Language:Shell 97.8%Language:Dockerfile 2.2%