hbutani / spark-druid-olap

Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.

Home Page:http://sparklinedata.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

aggregation when join with another table

redlion99 opened this issue · comments

I have a InMemo table click_cached, And I try to join this table with a druid table cl_events_test and aggregate with druid like this select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date)

But I found druid index is not used in this case .

explain select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#318 as date)#473], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#456L,theday#448]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#318 as date)#473,200), None |
| +- TungstenAggregate(key=[cast(timestamp#318 as date) AS cast(timestamp#318 as date)#473], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#318 as date)#473,count#475L]) |
| +- Project [timestamp#318] |
| +- BroadcastHashJoin [customerId#316L], [customerId#453L], BuildRight |
| :- Project [timestamp#318,customerId#316L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test,10.25.2.91,cl_events_test), sourceDFName = cl_events_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#313,targetId#314,targetName#315,customerId#316L,source#317,timestamp#318] |
| +- InMemoryColumnarTableScan [customerId#453L], InMemoryRelation [_c0#323L,theday#322,customerId#453L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#325L AS _c0#323L,cast(alias-1#324 as date) AS theday#322,cast(customerId#316 as bigint) AS customerId#316L], Some(click_cached) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

When you define the Druid Datasource make sure you setup the Star Schema relations. See StarSchemaBaseTest:beforeAll method on an exaple of setting up the Star Schema metadata for the TPCH dataset. JoinTest has examples on Star Join queries being pushed to Druid.

I defined a starSchema like this:

CREATE TABLE cl_events_test_base (
event string
,targetId string
,targetName string
, customerId bigint
, source string
, tenantId string
, timestamp string
)
USING com.databricks.spark.csv
OPTIONS (path "/opt/events.csv",
header "false", delimiter ",")

CREATE TABLE cl_events_test_1
USING org.sparklinedata.druid
OPTIONS (
sourceDataframe "cl_events_test_base",
timeDimensionColumn "timestamp",
druidDatasource "cl_events_test",
druidHost "10.25.2.91",
zkQualifyDiscoveryNames "false",
queryHistoricalServers "true",
numSegmentsPerHistoricalQuery "1",
columnMapping '{ } ',
functionalDependencies '[] ',
starSchema ' { "factTable" : "cl_events_test_1", "relations" : [{"leftTable": "cl_events_test_1", "rightTable": "click_cached", "relationType": "n-1", "joinCondition": [ { "leftAttribute": "customerId", "rightAttribute": "c_customerId" } ] }] } ')

And the sql query I used is:

select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)

The content of sourceDataframe cl_events_test_base is empty, I assume the raw data set would not be used because this is an aggregate query.

explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#591 as date)#632], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#615L,theday#613]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#591 as date)#632,200), None |
| +- TungstenAggregate(key=[cast(timestamp#591 as date) AS cast(timestamp#591 as date)#632], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#591 as date)#632,count#634L]) |
| +- Project [timestamp#591] |
| +- BroadcastHashJoin [customerId#588L], [c_customerId#546L], BuildRight |
| :- Project [timestamp#591,customerId#588L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#585,targetId#586,targetName#587,customerId#588L,source#589,tenantId#590,timestamp#591] |
| +- InMemoryColumnarTableScan [c_customerId#546L], InMemoryRelation [_c0#547L,theday#545,c_customerId#546L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#549L AS _c0#547L,cast(alias-1#548 as date) AS theday#545,cast(customerId#486 as bigint) AS customerId#486L AS c_customerId#546L], Some(click_cached) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

Can you do the following:

  1. What is the explain for
    explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);

I want you to verify that this query is being pushed to Druid. If not, can you send us your indexing spec.

  1. before running explain on your original query, issue
    set spark.sparklinedata.debug.transformations=true
    and can you send us what is logged when you run explain.

Here is explain for explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);

explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#705 as date)#723], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#706L,theday#697]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#705 as date)#723,200), None |
| +- TungstenAggregate(key=[cast(timestamp#705 as date) AS cast(timestamp#705 as date)#723], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#705 as date)#723,count#725L]) |
| +- Project [timestamp#705] |
| +- BroadcastHashJoin [customerId#702L], [c_customerId#647L], BuildRight |
| :- Project [timestamp#705,customerId#702L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#699,targetId#700,targetName#701,customerId#702L,source#703,tenantId#704,timestamp#705] |
| +- InMemoryColumnarTableScan [c_customerId#647L], InMemoryRelation [_c0#660L,theday#646,c_customerId#647L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#662L AS _c0#660L,cast(alias-1#661 as date) AS theday#646,cast(customerId#657 as bigint) AS customerId#657L AS c_customerId#647L], Some(click_cached) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
11 rows selected (0.23 seconds)
0: jdbc:hive2://localhost:10000/> explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);
+-----------------------------------------------------------------------------+--+
| plan |
+-----------------------------------------------------------------------------+--+
| == Physical Plan == |
| Project [alias-2#730L AS _c0#728L,cast(alias-1#729 as date) AS theday#726] |
| +- Scan DruidQuery(1805079266): { |
| "q" : { |
| "jsonClass" : "GroupByQuerySpec", |
| "queryType" : "groupBy", |
| "dataSource" : "cl_events_test", |
| "dimensions" : [ { |
| "jsonClass" : "ExtractionDimensionSpec", |
| "type" : "extraction", |
| "dimension" : "__time", |
| "outputName" : "alias-1", |
| "extractionFn" : { |
| "jsonClass" : "TimeFormatExtractionFunctionSpec", |
| "type" : "timeFormat", |
| "format" : "YYYY-MM-dd", |
| "timeZone" : "Asia/Shanghai", |
| "locale" : "en_US" |
| } |
| } ], |
| "granularity" : "all", |
| "aggregations" : [ { |
| "jsonClass" : "FunctionAggregationSpec", |
| "type" : "longSum", |
| "name" : "alias-2", |
| "fieldName" : "count" |
| } ], |
| "intervals" : [ "2016-06-06T02:00:00.000Z/2016-08-22T22:00:01.000Z" ] |
| }, |
| "useSmile" : true, |
| "queryHistoricalServer" : false, |
| "numSegmentsPerQuery" : -1, |
| "intervalSplits" : [ { |
| "start" : 1465178400000, |
| "end" : 1471903201000 |
| } ], |
| "outputAttrSpec" : [ { |
| "exprId" : { |
| "id" : 729, |
| "jvmId" : { } |
| }, |
| "name" : "alias-1", |
| "dataType" : { }, |
| "tf" : "toString" |
| }, { |
| "exprId" : { |
| "id" : 730, |
| "jvmId" : { } |
| }, |
| "name" : "alias-2", |
| "dataType" : { }, |
| "tf" : "toLong" |
| } ] |
| }[alias-1#729,alias-2#730L] |
+-----------------------------------------------------------------------------+--+

spark log when run the origin explain:

16/08/29 16:38:11 INFO thriftserver.SparkExecuteStatementOperation: Running query 'explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)' with fb1ea514-3983-40f7-8233-8f3db8c414d7
16/08/29 16:38:11 INFO parse.ParseDriver: Parsing command: explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)
16/08/29 16:38:12 INFO parse.ParseDriver: Parse Completed
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/08/29 16:38:12 INFO metastore.ObjectStore: ObjectStore, initialize called
16/08/29 16:38:12 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
16/08/29 16:38:12 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/08/29 16:38:12 INFO metastore.ObjectStore: Initialized ObjectStore
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 210.0 KB, free 211.9 KB)
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.6 KB, free 231.5 KB)
16/08/29 16:38:12 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:53388 (size: 19.6 KB, free: 511.1 MB)
16/08/29 16:38:12 INFO spark.SparkContext: Created broadcast 4 from textFile at CsvRelation.scala:66
16/08/29 16:38:12 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List(plan#110)

Thanks. The non join query is being pushed to Druid. Don't see why the join is not.

When you set spark.sparklinedata.debug.transformations=true, you should see DruidPlanner log lines, for example see below. Can you send me the DruidPlanner log lines.

Also which version are you using?

16/08/29 15:53:07 INFO DruidPlanner: aggregate transform invoked:
Input DruidQueryBuilders : null 
Input LogicalPlan : Project [l_returnflag#337,l_extendedprice#334,ps_supplycost#349,ps_availqty#348]
+- Relation[o_orderkey#321,o_custkey#322,o_orderstatus#323,o_totalprice#324,o_orderdate#325,o_orderpriority#326,o_clerk#327,o_shippriority#328,o_comment#329,l_partkey#330,l_suppkey#331,l_linenumber#332,l_quantity#333,l_extendedprice#334,l_discount#335,l_tax#336,l_returnflag#337,l_linestatus#338,l_shipdate#339,l_commitdate#340,l_receiptdate#341,l_shipinstruct#342,l_shipmode#343,l_comment#344,order_year#345,ps_partkey#346,ps_suppkey#347,ps_availqty#348,ps_supplycost#349,ps_comment#350,s_name#351,s_address#352,s_phone#353,s_acctbal#354,s_comment#355,s_nation#356,s_region#357,p_name#358,p_mfgr#359,p_brand#360,p_type#361,p_size#362,p_container#363,p_retailprice#364,p_comment#365,c_name#366,c_address#367,c_phone#368,c_acctbal#369,c_mktsegment#370,c_comment#371,c_nation#372,c_region#373] DruidRelationInfo(fullName = DruidRelationName(orderLineItemPartSupplier,localhost,tpch), sourceDFName = orderLineItemPartSupplierBase,
timeDimensionCol = l_shipdate,
options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,false,true,2147483647,true,push_none,org.sparklinedata.druid.NoneGranularity$@14fe2c49,true,100000,Some(1)))
Output DruidQueryBuilders : List()

I'm using version 0.0.2. I tried to change the configuration by editing spark-defaults.conf, but I did not find the DruidPlanner log you metioned. Could you tell me how to make this settings work.

cat conf/spark-defaults.conf
spark.sparklinedata.debug.transformations true
root@spark-server:/opt/spark-1.6.0#

Yes 0.2.x this setting doesn't take effect when set in the properties file.
Explicitly call set in your session

set spark.sparklinedata.debug.transformations=true

any update on this?
We have released 0.3.0, you should upgrade to using it.