sumanthn / adhoc-query

Adhoc query on structured data with Hive Presto

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adhoc query on structured data

Fire SQL queries on accesslog data; check out factdatagenerator for more on generating access log storage - ORC compressed with snappy Query Engine - Presto(https://prestodb.io/) Unlike the other data processing & query processing engines, this is a different challenge since the data cannot be modelled as required for the queries , since queries are unknown.


Data-storage

create  table factdata(
accessUrl string,
responseStatusCode int,
responseTime int,
accessTimestamp string,
requestVerb string,
requestSize int,
dataExchangeSize int,
serverIp string,
clientIp string,
clientId string,
sessionId string,
userAgentDevice string,
UserAgentType string,
userAgentFamily string,
userAgentOSFamily string,
userAgentVersion string,
userAgentOSVersion string,
city string,
country string,
region string,
minOfDay int,
hourOfDay int,
dayOfWeek int,
monthOfYear int


) 
PARTITIONED BY (day string)
stored as orc tblproperties ("orc.compress"="SNAPPY", "orc.stripe.size"="67108864","orc.row.index.stride"="50000") ;

Generate the raw data using factgenerator and place in a temporary storage say /stage

Load data into Hive

load data inpath '/stage/fdata_20150401_20150402_5000.orc' into table factdata  PARTITION (day='20150401');

....

Querying Start hive meta server Setup presto with hive catalog Query from presto-cli or use presto-jdbc Presto supports ANSI -SQL

Results On a 4 node server [4 cores, 16G & 4T each]backed by hdfs & query engine as presto 5000 tps files for 1 month data ~12 Billion records

Results
Query Type Projections Group By Order By Filter Timespan Response time minutes
1 1 1 0 3 days 1
4 2 2 2 1 week 3
5 5 5 0 1 week 5.4
3 0 3 3 2 weeks 2

Learnings

  • Read as less as you can - ORC is your friend in reading only columns required
  • Read sequentially , no index but do skip over the files in ORC using min & max for each stripe or skip over partitions
  • Read in parallel
  • Try to do as much as possible in-memory

Schema Evolution Query over different ORC file sets with different schema works without a glitch

TODO: use TimeStamp type for accessTimestamp

About

Adhoc query on structured data with Hive Presto