aggregation when join with another table
redlion99 opened this issue · comments
I have a InMemo table click_cached
, And I try to join this table with a druid table cl_events_test
and aggregate with druid like this select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date)
But I found druid index is not used in this case .
explain select count(1),cast(cl_events_test.timestamp as date) as theday from cl_events_test, click_cached where click_cached.customerId=cl_events_test.customerId group by cast(cl_events_test.timestamp as date);
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#318 as date)#473], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#456L,theday#448]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#318 as date)#473,200), None |
| +- TungstenAggregate(key=[cast(timestamp#318 as date) AS cast(timestamp#318 as date)#473], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#318 as date)#473,count#475L]) |
| +- Project [timestamp#318] |
| +- BroadcastHashJoin [customerId#316L], [customerId#453L], BuildRight |
| :- Project [timestamp#318,customerId#316L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test,10.25.2.91,cl_events_test), sourceDFName = cl_events_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#313,targetId#314,targetName#315,customerId#316L,source#317,timestamp#318] |
| +- InMemoryColumnarTableScan [customerId#453L], InMemoryRelation [_c0#323L,theday#322,customerId#453L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#325L AS _c0#323L,cast(alias-1#324 as date) AS theday#322,cast(customerId#316 as bigint) AS customerId#316L], Some(click_cached) |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
When you define the Druid Datasource make sure you setup the Star Schema relations. See StarSchemaBaseTest:beforeAll method on an exaple of setting up the Star Schema metadata for the TPCH dataset. JoinTest has examples on Star Join queries being pushed to Druid.
I defined a starSchema like this:
CREATE TABLE cl_events_test_base (
event string
,targetId string
,targetName string
, customerId bigint
, source string
, tenantId string
, timestamp string
)
USING com.databricks.spark.csv
OPTIONS (path "/opt/events.csv",
header "false", delimiter ",")CREATE TABLE cl_events_test_1
USING org.sparklinedata.druid
OPTIONS (
sourceDataframe "cl_events_test_base",
timeDimensionColumn "timestamp",
druidDatasource "cl_events_test",
druidHost "10.25.2.91",
zkQualifyDiscoveryNames "false",
queryHistoricalServers "true",
numSegmentsPerHistoricalQuery "1",
columnMapping '{ } ',
functionalDependencies '[] ',
starSchema ' { "factTable" : "cl_events_test_1", "relations" : [{"leftTable": "cl_events_test_1", "rightTable": "click_cached", "relationType": "n-1", "joinCondition": [ { "leftAttribute": "customerId", "rightAttribute": "c_customerId" } ] }] } ')
And the sql query I used is:
select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)
The content of sourceDataframe cl_events_test_base is empty, I assume the raw data set would not be used because this is an aggregate query.
explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#591 as date)#632], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#615L,theday#613]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#591 as date)#632,200), None |
| +- TungstenAggregate(key=[cast(timestamp#591 as date) AS cast(timestamp#591 as date)#632], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#591 as date)#632,count#634L]) |
| +- Project [timestamp#591] |
| +- BroadcastHashJoin [customerId#588L], [c_customerId#546L], BuildRight |
| :- Project [timestamp#591,customerId#588L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#585,targetId#586,targetName#587,customerId#588L,source#589,tenantId#590,timestamp#591] |
| +- InMemoryColumnarTableScan [c_customerId#546L], InMemoryRelation [_c0#547L,theday#545,c_customerId#546L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#549L AS _c0#547L,cast(alias-1#548 as date) AS theday#545,cast(customerId#486 as bigint) AS customerId#486L AS c_customerId#546L], Some(click_cached) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
Can you do the following:
- What is the explain for
explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);
I want you to verify that this query is being pushed to Druid. If not, can you send us your indexing spec.
- before running explain on your original query, issue
set spark.sparklinedata.debug.transformations=true
and can you send us what is logged when you run explain.
Here is explain for explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);
explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date);
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| plan |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| == Physical Plan == |
| TungstenAggregate(key=[cast(timestamp#705 as date)#723], functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#706L,theday#697]) |
| +- TungstenExchange hashpartitioning(cast(timestamp#705 as date)#723,200), None |
| +- TungstenAggregate(key=[cast(timestamp#705 as date) AS cast(timestamp#705 as date)#723], functions=[(count(1),mode=Partial,isDistinct=false)], output=[cast(timestamp#705 as date)#723,count#725L]) |
| +- Project [timestamp#705] |
| +- BroadcastHashJoin [customerId#702L], [c_customerId#647L], BuildRight |
| :- Project [timestamp#705,customerId#702L] |
| : +- Scan DruidRelationInfo(fullName = DruidRelationName(cl_events_test_1,10.25.2.91,cl_events_test), sourceDFName = cl_events_test_base, |
| timeDimensionCol = timestamp, |
| options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,true,false,1,true,None))[event#699,targetId#700,targetName#701,customerId#702L,source#703,tenantId#704,timestamp#705] |
| +- InMemoryColumnarTableScan [c_customerId#647L], InMemoryRelation [_c0#660L,theday#646,c_customerId#647L], true, 10000, StorageLevel(true, true, false, true, 1), Project [alias-2#662L AS _c0#660L,cast(alias-1#661 as date) AS theday#646,cast(customerId#657 as bigint) AS customerId#657L AS c_customerId#647L], Some(click_cached) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
11 rows selected (0.23 seconds)
0: jdbc:hive2://localhost:10000/> explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1 group by cast(cl_events_test_1.timestamp as date);
+-----------------------------------------------------------------------------+--+
| plan |
+-----------------------------------------------------------------------------+--+
| == Physical Plan == |
| Project [alias-2#730L AS _c0#728L,cast(alias-1#729 as date) AS theday#726] |
| +- Scan DruidQuery(1805079266): { |
| "q" : { |
| "jsonClass" : "GroupByQuerySpec", |
| "queryType" : "groupBy", |
| "dataSource" : "cl_events_test", |
| "dimensions" : [ { |
| "jsonClass" : "ExtractionDimensionSpec", |
| "type" : "extraction", |
| "dimension" : "__time", |
| "outputName" : "alias-1", |
| "extractionFn" : { |
| "jsonClass" : "TimeFormatExtractionFunctionSpec", |
| "type" : "timeFormat", |
| "format" : "YYYY-MM-dd", |
| "timeZone" : "Asia/Shanghai", |
| "locale" : "en_US" |
| } |
| } ], |
| "granularity" : "all", |
| "aggregations" : [ { |
| "jsonClass" : "FunctionAggregationSpec", |
| "type" : "longSum", |
| "name" : "alias-2", |
| "fieldName" : "count" |
| } ], |
| "intervals" : [ "2016-06-06T02:00:00.000Z/2016-08-22T22:00:01.000Z" ] |
| }, |
| "useSmile" : true, |
| "queryHistoricalServer" : false, |
| "numSegmentsPerQuery" : -1, |
| "intervalSplits" : [ { |
| "start" : 1465178400000, |
| "end" : 1471903201000 |
| } ], |
| "outputAttrSpec" : [ { |
| "exprId" : { |
| "id" : 729, |
| "jvmId" : { } |
| }, |
| "name" : "alias-1", |
| "dataType" : { }, |
| "tf" : "toString" |
| }, { |
| "exprId" : { |
| "id" : 730, |
| "jvmId" : { } |
| }, |
| "name" : "alias-2", |
| "dataType" : { }, |
| "tf" : "toLong" |
| } ] |
| }[alias-1#729,alias-2#730L] |
+-----------------------------------------------------------------------------+--+
spark log when run the origin explain:
16/08/29 16:38:11 INFO thriftserver.SparkExecuteStatementOperation: Running query 'explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)' with fb1ea514-3983-40f7-8233-8f3db8c414d7
16/08/29 16:38:11 INFO parse.ParseDriver: Parsing command: explain select count(1),cast(cl_events_test_1.timestamp as date) as theday from cl_events_test_1, click_cached where click_cached.c_customerId=cl_events_test_1.customerId group by cast(cl_events_test_1.timestamp as date)
16/08/29 16:38:12 INFO parse.ParseDriver: Parse Completed
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_1
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/08/29 16:38:12 INFO metastore.ObjectStore: ObjectStore, initialize called
16/08/29 16:38:12 INFO DataNucleus.Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
16/08/29 16:38:12 INFO metastore.MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/08/29 16:38:12 INFO metastore.ObjectStore: Initialized ObjectStore
16/08/29 16:38:12 INFO metastore.HiveMetaStore: 4: get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO HiveMetaStore.audit: ugi=anonymous ip=unknown-ip-addr cmd=get_table : db=default tbl=cl_events_test_base
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 210.0 KB, free 211.9 KB)
16/08/29 16:38:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 19.6 KB, free 231.5 KB)
16/08/29 16:38:12 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:53388 (size: 19.6 KB, free: 511.1 MB)
16/08/29 16:38:12 INFO spark.SparkContext: Created broadcast 4 from textFile at CsvRelation.scala:66
16/08/29 16:38:12 INFO thriftserver.SparkExecuteStatementOperation: Result Schema: List(plan#110)
Thanks. The non join query is being pushed to Druid. Don't see why the join is not.
When you set spark.sparklinedata.debug.transformations=true, you should see DruidPlanner log lines, for example see below. Can you send me the DruidPlanner log lines.
Also which version are you using?
16/08/29 15:53:07 INFO DruidPlanner: aggregate transform invoked:
Input DruidQueryBuilders : null
Input LogicalPlan : Project [l_returnflag#337,l_extendedprice#334,ps_supplycost#349,ps_availqty#348]
+- Relation[o_orderkey#321,o_custkey#322,o_orderstatus#323,o_totalprice#324,o_orderdate#325,o_orderpriority#326,o_clerk#327,o_shippriority#328,o_comment#329,l_partkey#330,l_suppkey#331,l_linenumber#332,l_quantity#333,l_extendedprice#334,l_discount#335,l_tax#336,l_returnflag#337,l_linestatus#338,l_shipdate#339,l_commitdate#340,l_receiptdate#341,l_shipinstruct#342,l_shipmode#343,l_comment#344,order_year#345,ps_partkey#346,ps_suppkey#347,ps_availqty#348,ps_supplycost#349,ps_comment#350,s_name#351,s_address#352,s_phone#353,s_acctbal#354,s_comment#355,s_nation#356,s_region#357,p_name#358,p_mfgr#359,p_brand#360,p_type#361,p_size#362,p_container#363,p_retailprice#364,p_comment#365,c_name#366,c_address#367,c_phone#368,c_acctbal#369,c_mktsegment#370,c_comment#371,c_nation#372,c_region#373] DruidRelationInfo(fullName = DruidRelationName(orderLineItemPartSupplier,localhost,tpch), sourceDFName = orderLineItemPartSupplierBase,
timeDimensionCol = l_shipdate,
options = DruidRelationOptions(1000000,100000,true,true,true,30000,true,/druid,false,true,2147483647,true,push_none,org.sparklinedata.druid.NoneGranularity$@14fe2c49,true,100000,Some(1)))
Output DruidQueryBuilders : List()
I'm using version 0.0.2. I tried to change the configuration by editing spark-defaults.conf, but I did not find the DruidPlanner log you metioned. Could you tell me how to make this settings work.
cat conf/spark-defaults.conf
spark.sparklinedata.debug.transformations true
root@spark-server:/opt/spark-1.6.0#
Yes 0.2.x this setting doesn't take effect when set in the properties file.
Explicitly call set in your session
set spark.sparklinedata.debug.transformations=true
any update on this?
We have released 0.3.0, you should upgrade to using it.