aggregation when join with another table

Question

aggregation when join with another table

redlion99 opened this issue 8 years ago · comments

I have a InMemo table click_cached, And I try to join this table with a druid table cl_events_test and aggregate with druid like this select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date)

But I found druid index is not used in this case .

explain select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#318 as date)#473], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#456L,theday#448]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#318 as date)#473,200), None |
| +- TungstenAggregate(key=[cast(timestamp#318 as date) AS cast(timestamp#318 as date)#473], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#318 as date)#473,count#475L]) |
| +- Project [timestamp#318] |
| +- BroadcastHashJoin [customerId#316L], [customerId#453L], BuildRight |
| :- Project [timestamp#318,customerId#316L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test,10.25.2.91,cl_events_test), sourceDFName = cl_events_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#313,targetId#314,targetName#315,customerId#316L,source#317,timestamp#318] |
| +- InMemoryColumnarTableScan [customerId#453L], InMemoryRelation [_c0#323L,theday#322,customerId#453L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#325L AS _c0#323L,cast(alias-1#324 as date) AS theday#322,cast(customerId#316 as bigint) AS customerId#316L], Some(click_cached) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

Harish Butani · Answer 1 · Tue Aug 23 2016 23:53:23 GMT+0800 (China Standard Time)

When you define the Druid Datasource make sure you setup the Star Schema relations. See StarSchemaBaseTest:beforeAll method on an exaple of setting up the Star Schema metadata for the TPCH dataset. JoinTest has examples on Star Join queries being pushed to Druid.

redlion99 · Answer 2 · Wed Aug 24 2016 10:51:11 GMT+0800 (China Standard Time)

I defined a starSchema like this:

CREATE TABLE cl_events_test_base (
event string
,targetId string
,targetName string
, customerId bigint
, source string
, tenantId string
, timestamp string
)
USING com.databricks.spark.csv
OPTIONS (path "/opt/events.csv",
header "false", delimiter ",")

CREATE TABLE cl_events_test_1
USING org.sparklinedata.druid
OPTIONS (
sourceDataframe "cl_events_test_base",
timeDimensionColumn "timestamp",
druidDatasource "cl_events_test",
druidHost "10.25.2.91",
zkQualifyDiscoveryNames "false",
queryHistoricalServers "true",
numSegmentsPerHistoricalQuery "1",
columnMapping '{ } ',
functionalDependencies '[] ',
starSchema ' { "factTable" : "cl_events_test_1", "relations" : [{"leftTable": "cl_events_test_1", "rightTable": "click_cached", "relationType": "n-1", "joinCondition": [ { "leftAttribute": "customerId", "rightAttribute": "c_customerId" } ] }] } ')

And the sql query I used is:

select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)

The content of sourceDataframe cl_events_test_base is empty, I assume the raw data set would not be used because this is an aggregate query.

explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#591 as date)#632], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#615L,theday#613]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#591 as date)#632,200), None |
| +- TungstenAggregate(key=[cast(timestamp#591 as date) AS cast(timestamp#591 as date)#632], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#591 as date)#632,count#634L]) |
| +- Project [timestamp#591] |
| +- BroadcastHashJoin [customerId#588L], [c_customerId#546L], BuildRight |
| :- Project [timestamp#591,customerId#588L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#585,targetId#586,targetName#587,customerId#588L,source#589,tenantId#590,timestamp#591] |
| +- InMemoryColumnarTableScan [c_customerId#546L], InMemoryRelation [_c0#547L,theday#545,c_customerId#546L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#549L AS _c0#547L,cast(alias-1#548 as date) AS theday#545,cast(customerId#486 as bigint) AS customerId#486L AS c_customerId#546L], Some(click_cached) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+

Harish Butani · Answer 3 · Thu Aug 25 2016 23:26:08 GMT+0800 (China Standard Time)

Can you do the following:

What is the explain for
explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);

I want you to verify that this query is being pushed to Druid. If not, can you send us your indexing spec.

before running explain on your original query, issue
set spark.sparklinedata.debug.transformations=true
and can you send us what is logged when you run explain.

redlion99 · Answer 4 · Mon Aug 29 2016 15:54:46 GMT+0800 (China Standard Time)

Here is explain for explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);

explain select count(1),cast(c +----------------------------- | +----------------------------- | == Physical Plan == | TungstenAggregate(key=[cast(timestamp#705 | +- TungstenExchange | +- TungstenAggregate(key= | +- Project [timestamp#705] | +- BroadcastHashJoin | :- Project | : +- Scan | timeDimensionCol = timestamp, | options = DruidRelationOptio | +- InMemoryColumnarTableScan +----------------------------- 11 rows selected (0.23 seconds)
0: jdbc:hive2://localhost:10000/> +----------------------------- | +----------------------------- | == Physical Plan == | Project [alias-2#730L | +- Scan DruidQuery(1805079266): { | "q" : { | "jsonClass" : "GroupByQuerySpec", | "queryType" : "groupBy", | "dataSource" : "cl_events_test", | "dimensions" : [ { | "jsonClass" : | "type" : "extraction", | "dimension" : "__time", | "outputName" : "alias-1", | "extractionFn" : { | "jsonClass" | "type" : "timeFormat", | "format" : "YYYY-MM-dd", | "timeZone" : "Asia/Shanghai", | "locale" : "en_US" | } | } ], | "granularity" : "all", | "aggregations" : [ { | "jsonClass" : | "type" : "longSum", | "name" : "alias-2", | "fieldName" : "count" | } ], | "intervals" : [ | }, | "useSmile" : true, | "queryHistoricalServer" : false, | "numSegmentsPerQuery" : -1, | "intervalSplits" : [ { | "start" : 1465178400000, | "end" : 1471903201000 | } ], | "outputAttrSpec" : [ { | "exprId" : { | "id" : 729, | "jvmId" : { } | }, | "name" : "alias-1", | "dataType" : { }, | "tf" : "toString" | }, { | "exprId" : { | "id" : 730, | "jvmId" : { } | }, | "name" : "alias-2", | "dataType" : { }, | "tf" : "toLong" | } ] | }[alias-1#729,alias-2#730L] +----------------------------- l_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
plan |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
|
as date)#723], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#706L,theday#697]) |
hashpartitioning(cast(timestamp#705 as date)#723,200), None |
[cast(timestamp#705 as date) AS cast(timestamp#705 as date)#723], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#705 as date)#723,count#725L]) |
|
[customerId#702L], [c_customerId#647L], BuildRight |
[timestamp#705,customerId#702L] |
DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
|
ns(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#699,targetId#700,targetName#701,customerId#702L,source#703,tenantId#704,timestamp#705] |
[c_customerId#647L], InMemoryRelation [_c0#660L,theday#646,c_customerId#647L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#662L AS _c0#660L,cast(alias-1#661 as date) AS theday#646,cast(customerId#657 as bigint) AS customerId#657L AS c_customerId#647L], Some(click_cached) |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);
------------------------------------------------+--+
plan |
------------------------------------------------+--+
|
AS _c0#728L,cast(alias-1#729 as date) AS theday#726] |
|
|
|
|
|
|
"ExtractionDimensionSpec", |
|
|
|
|
: "TimeFormatExtractionFunctionSpec", |
|
|
|
|
|
|
|
|
"FunctionAggregationSpec", |
|
|
|
|
"2016-06-06T02:00:00.000Z/2016-08-22T22:00:01.000Z" ] |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
------------------------------------------------+--+

redlion99 · Answer 5 · Mon Aug 29 2016 16:39:53 GMT+0800 (China Standard Time)

spark log when run the origin explain:

16/08/29 16:38:11 INFO thriftserver.SparkExecuteStatementOperation: Running query 'explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)' with fb1ea514-3983-40f7-8233-8f3db8c414d7
16/08/29 16:38:11 INFO parse.ParseDriver: Parsing command: explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)
16/08/29 16:38:12 INFO parse.ParseDriver: Parse Completed
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/08/29 16:38:12 INFO metastore.ObjectStore: ObjectStore, initialize called
16/08/29 16:38:12 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
16/08/29 16:38:12 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/08/29 16:38:12 INFO metastore.ObjectStore: Initialized ObjectStore
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 210.0 KB, free 211.9 KB)
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.6 KB, free 231.5 KB)
16/08/29 16:38:12 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:53388 (size: 19.6 KB, free: 511.1 MB)
16/08/29 16:38:12 INFO spark.SparkContext: Created broadcast 4 from textFile at CsvRelation.scala:66
16/08/29 16:38:12 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List(plan#110)

Harish Butani · Answer 6 · Mon Aug 29 2016 23:56:51 GMT+0800 (China Standard Time)

Thanks. The non join query is being pushed to Druid. Don't see why the join is not.

When you set spark.sparklinedata.debug.transformations=true, you should see DruidPlanner log lines, for example see below. Can you send me the DruidPlanner log lines.

Also which version are you using?

16/08/29 15:53:07 INFO DruidPlanner: aggregate transform invoked:
Input DruidQueryBuilders : null 
Input LogicalPlan : Project [l_returnflag#337,l_extendedprice#334,ps_supplycost#349,ps_availqty#348]
+- Relation[o_orderkey#321,o_custkey#322,o_orderstatus#323,o_totalprice#324,o_orderdate#325,o_orderpriority#326,o_clerk#327,o_shippriority#328,o_comment#329,l_partkey#330,l_suppkey#331,l_linenumber#332,l_quantity#333,l_extendedprice#334,l_discount#335,l_tax#336,l_returnflag#337,l_linestatus#338,l_shipdate#339,l_commitdate#340,l_receiptdate#341,l_shipinstruct#342,l_shipmode#343,l_comment#344,order_year#345,ps_partkey#346,ps_suppkey#347,ps_availqty#348,ps_supplycost#349,ps_comment#350,s_name#351,s_address#352,s_phone#353,s_acctbal#354,s_comment#355,s_nation#356,s_region#357,p_name#358,p_mfgr#359,p_brand#360,p_type#361,p_size#362,p_container#363,p_retailprice#364,p_comment#365,c_name#366,c_address#367,c_phone#368,c_acctbal#369,c_mktsegment#370,c_comment#371,c_nation#372,c_region#373] DruidRelationInfo(fullName = DruidRelationName(orderLineItemPartSupplier,localhost,tpch), sourceDFName = orderLineItemPartSupplierBase,
timeDimensionCol = l_shipdate,
options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,false,true,2147483647,true,push_none,org.sparklinedata.druid.NoneGranularity$@14fe2c49,true,100000,Some(1)))
Output DruidQueryBuilders : List()

redlion99 · Answer 7 · Tue Aug 30 2016 13:42:55 GMT+0800 (China Standard Time)

I'm using version 0.0.2. I tried to change the configuration by editing spark-defaults.conf, but I did not find the DruidPlanner log you metioned. Could you tell me how to make this settings work.

cat conf/spark-defaults.conf
spark.sparklinedata.debug.transformations true
root@spark-server:/opt/spark-1.6.0#

Harish Butani · Answer 8 · Thu Sep 01 2016 00:06:46 GMT+0800 (China Standard Time)

Yes 0.2.x this setting doesn't take effect when set in the properties file.
Explicitly call set in your session

set spark.sparklinedata.debug.transformations=true

Harish Butani · Answer 9 · Thu Sep 08 2016 22:09:57 GMT+0800 (China Standard Time)

any update on this?
We have released 0.3.0, you should upgrade to using it.