dojowahi / hive_log_scraper

Scrape Hive logs to extract tables queried by Hive users

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Hive Log Scraper

get_hive_tbl_list.sh

When we have a huge amount of tables in a Hive warehouse, we want to know what tables are queried most often by users. By getting a list of most used tables, we can focus more time on these tables than others. The code in this repository is based on a shell script which copies hive logs to a tmp location and then the logs are parsed by a Python script to get a list of all tables.

get_hive_query.sh

If you want to review all the queries executed in Hive and execution time, then this script will capture all the succesful queries executed along with their time taken.

sql_parse_test.py

The python script parses any sql passed to it and returns table names

Prerequisites

You will need to create directories to store the scripts, logs and a location where the hive logs will be copied. The copied logs are deleted at the end of the process. Once you have the directories created, go to the global_var.sh and update the parameters accordingly and you should be all set.

  • All tests were done on Hive 2.3.3
  • GNU parallel needs to be installed on machine
  • property.hive.log.level is set to INFO in hive-log4j2.properties

About

Scrape Hive logs to extract tables queried by Hive users

License:GNU General Public License v3.0


Languages

Language:Shell 80.0%Language:Python 20.0%