XavientInformationSystems / DataDumpUtility

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Dump Utility

This spark based command line driven utility can be used to fetch and store data from various source and destination file systems including s3 , gs , hdfs and local file type system. It can also be used to create a table in Amazon Athena if the destination data is S3

Command Line Options available

On top of the regular spark command line options , this utility provide switches to provide necessary information to retrieve and stored the data from the specific filesystem. These are

Generic Options

  • s : Source Location
  • d : Destination Location . Default to s + f
  • f : Destination data format . Defaults to ORC
  • e : External schema location . If not provided , the schema is created using the source file headers

S3 Related Options

  • s3ak : Access Key for the AWS System
  • s3sk : Secret Key for the AWS System

Google Cloud Related Options

  • gsi : Google Project Id
  • gss : Service Account for the GCS System
  • gsp : Path to the P12 file

Athena Related Optios

  • adb : Athena Database
  • at : Athena Table Name
  • as : Athena Staging Directory
  • act : Create Table - true or false .Defaults to false
  • acs : Athena Conection String
  • p : Create Partitioned Data

How to

build application

Unzip the project and perform a maven build in its root directory

mvn clean package

use with the generic options

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv -f parquet -d destinationDirectory -e hdfs://<<pathToExternalSchema

or can be used without the destination , format or the external schema

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv

For S3 file type system

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey> 

With s3 as a destination system we can also create an athena table by passing athena related options. Athena jar can be downloaded from here

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at <<table_name>> -adb <<Existing_dbname_name>> -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> 

Table can also be created with partion using the "p" switch to true

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at finalTest -adb sampledb -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> -p true

If the table created is partioned then execute the following command to in the Athena console before viewing the data

MSCK REPAIR TABLE  <<dbname>>.<<tablename>>

For Google Cloud File System

spark-submit --master yarn --class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d gs://<<destination>>  -gsi <<google project id >>  -gss <<google service account>> -gsp <<path to .p12 file >> -f parquet

About


Languages

Language:Scala 100.0%