tlepple / mac-cdh-discovery-bundle-builder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Discovery Bundle builder

Overview

This Discovery Tool is a lighweight automation package that can run against a CDH or CDP cluster to produce a "Discovery Bundle" that is useful for CDP migration planning. It uses CM/cluster APIs to gather the following data:

  • Cluster specs & layout
  • Cluster configurations
  • Data & table details
  • Resource usage across components (via metrics)
  • Workload logs (in a format ready for WXM direct upload)

Prerequisites

In order to execute the discovery bundle tool following prerequisites have to be met:

  • Deploy the project to one of the cluster nodes
  • If the node does not have access first download the project to a temporary place where you execute the steps detailed below
  • Cloudera Manager server, Sentry database, Hive Metastore database should be reachable from the node
  • Credential for Cloudera Manager which is able to call CM API endpoints
  • Hadoop clients (hdfs cli) needs to be installed on the node
  • Kerberos libs (kinit) needs to be installed if the cluster is kerberos is enabled on cluster
  • HDFS supergroup member principal is needed for kinit
  • Python >= 3.6.8 and virtualenv needs to be installed

Steps to run the project

Download the project

To download the project dependencies internet access is needed.

  • If the node where you plan to execute the script has internet access, download the project directly.
  • If you plan to execute the script in an air-gaped environment, first download the project and its dependencies to a temporary node with internet access, than deploy them to the final destination.

Install Python 3.7 and virtual env if not available on cluster node. (Python 3.6.8 is also tested)

sudo -i

yum install -y gcc openssl-devel bzip2-devel libffi-devel zlib-devel xz-devel && 
cd /usr/src && 
wget https://www.python.org/ftp/python/3.7.11/Python-3.7.11.tgz && 
tar xzf Python-3.7.11.tgz && 
cd Python-3.7.11 && 
./configure --enable-optimizations && 
make altinstall && 
rm -f /usr/src/Python-3.7.11.tgz && 
/usr/local/bin/python3.7 -V &&
cd &&
echo "Python installation successfully finished" &&
/usr/local/bin/python3.7 -m pip install --user virtualenv

On a Node with internet access


Download the project to /opt


cd /opt
yum -y install git
git clone <repo_url>
git clone https://github.com/tlepple/mac-cdh-discovery-bundle-builder.git

Download the project dependencies, wheelhouse.tar.gz is createad as a result


cd /opt/mac-cdh-discovery-bundle-builder
/usr/local/bin/python3.7 -m venv .venv
source .venv/bin/activate


./download_dependencies.sh
pip install -r requirements.txt


On a Node without internet access

Login to a node with internet access, and download the project


cd /tmp
yum -y install git
git clone <repo_url>
git clone https://github.com/tlepple/mac-cdh-discovery-bundle-builder.git
cd /tmp/mac-cdh-discovery-bundle-builder/

Imporant same version of pythons should be used to resolve the dependencies on the temporary location.

Download the project dependencies, wheelhouse.tar.gz is createad as a result

cd /opt/mac-cdh-discovery-bundle-builder
/usr/local/bin/python3.7 -m venv .venv
source .venv/bin/activate


./download_dependencies.sh

Copy the project to the final destination

rsync  -Paz --exclude={'.git','.venv'}  /tmp/mac-cdh-discovery-bundle-builder <target-node>:/opt

On the target node

Go to the project directory:

cd /opt/mac-cdh-discovery-bundle-builder/

Create a new virtual environment inside the project directory:

/usr/local/bin/python3.7 -m venv .venv
source .venv/bin/activate
  • For environments without internet access use the prepacked dependencies:
 tar -zxf wheelhouse.tar.gz
 pip install -r wheelhouse/requirements.txt --no-index --find-links wheelhouse


Run the following steps regardless


Config credentials to call the Cloudera Manager API

  • Set the Cloudera Manager credentials in config.ini.
  • Provide the path the JDBC driver for the HMS and Sentry databases. Usually it is located under /usr/share/java/
vi /opt/mac-cdh-discovery-bundle-builder/mac-discovery-bundle-builder/config/config.ini

#Edit the file
[credentials]
cm_user=<cm_admin_username>
cm_password=<cm_admin_password>
db_driver_path=<jdbc-connector-path>

vi /opt/mac-cdh-discovery-bundle-builder/mac-discovery-bundle-builder/config/config.ini

[credentials]
cm_user=admin
cm_password=Supersecret1
[database]
db_driver_path=/usr/share/java/postgresql-connector-java.jar

In a kerberized environment you should kinit with principal who is member of HDFS supergroup:

kinit -kt <PATH_TO_HDFS_SUPERGROUP_KEYTAB> <HDFS_PRINCIPAL>

In a NON kerberized environment you should use a username who is member of HDFS supergroup:

export HADOOP_USER_NAME=hdfs

Run the discovery bundle tool

Use the following command to execute the collection:

./discovery_bundle_builder.sh --cm-host https(s)://<cm-hostname>:<cm-port> --time-range=7 --output-dir /tmp/discovery_bundle

cd /opt/mac-cdh-discovery-bundle-builder
./discovery_bundle_builder.sh --cm-host https(s)://<cm-hostname>:<cm-port> --time-range=7 --output-dir /tmp/discovery_bundle

./discovery_bundle_builder.sh --cm-host http://54.172.241.3:7180 --time-range=7 --output-dir /tmp/discovery_bundle

To execute a selected module:

./discovery_bundle_builder.sh --cm-host http(s)://<cm-hostname>:<cm-port> --time-range=7 --module cm_api --output-dir /tmp/discovery_bundle

Available modules:

  • cm_metrics
    • Fetches service specific metrics from CM
  • cm_api
    • Fetches information about the workload cluster
  • diagnostic_bundle
    • Downloads the diagnostic bundle, and fetches hardware related information from it
  • hdfs_report
    • Downloads the FS image, creates a report of the directories
  • hive_metastore
    • Communicates with the hive metastore, fetches the table information
  • sentry_extractor
    • Collects information about the sentry policies
  • mapreduce_extractor
    • Fetches the mapreduce job history logs from hdfs. if the cluster is secured you must kinit with a hdfs superusergroup member (or a user with permission to access the mapreduce job history logs in HDFS) before executing the script. HIVE queries are collected as part of the mapreduce jobs. Results are collected in a WXM compatible format.
  • spark_extractor
    • Fetches the spark job history logs from hdfs. if the cluster is secured you must kinit with a hdfs superusergroup member (or a user with permission to access the spark event log directory in HDFS) before executing the script. Results are collected in a WXM compatible format.
  • impala_profiles
    • diagnostic_bundle module must be executed before it. The module collects the impala profiles from the diag bundles in a WXM compatible format
  • all
    • default module, executes all the modules above.

Configurable parameters:

Options:
  -h, --help            show this help message and exit
  --module=<module>     Select a module to be executed. Defaults to all
  --cm-host=<cm_host>   Cloudera Manager host.
  --output-dir=<output_dir>
                        Output of the discovery bundle. Defaults to
                        /tmp/discovery_bundle
  --time-range=<time_range_in_days>
                        Time range in days to collect metrics. Defaults to 45
                        days.
  --disable-redaction   Option to disable redaction. If option not set, it
                        defaults to redacting sensitive values.

About redaction

Redaction is a process that obscures data. It helps organizations to comply with government and industry regulations, such as PCI (Payment Card Industry) and HIPAA, by making personally identifiable information (PII) unreadable except to those whose jobs require such access. With regards to Cloudera Manager API, the exported configuration may contain passwords and other sensitive information. Cloudera clusters implement some redaction features by default, while some features are configurable and require administrators to specifically enable them. For more information, see How to Enable Sensitive Data Redaction.

By default, the Discovery Bundle toolkit enables redaction, even if your cluster configuration has not been set up in this way. This guarantess that API calls to Cloudera Manager for configuration data do not include the sensitive information.

If you prefer to switch off redaction, you can set the --disable-redaction parameter.

About

License:Apache License 2.0


Languages

Language:Python 99.5%Language:Shell 0.5%