hbutani / spark-druid-olap

Sparkline BI Accelerator provides fast ad-hoc query capability over Logical Cubes. This has been folded into our SNAP Platform(http://bit.ly/2oBJSpP) an Integrated BI platform on Apache Spark.

Home Page:http://sparklinedata.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

why we need sourceDataframe?

exitNA opened this issue · comments

commented

hi:
as my understanding, druid already indexed data. why we still have a sourceDataframe like orderLineItemPartSupplierBase which providing a data source path?
creating a table schema maping to druid internal schema that should be enough. so would somebody explain it?

With support for Druid Select Query the Spark Source DataFrame is for metadata purpose only; so you could define an table in Spark with no data and then setup a Druid DataSource on top of it.

If you set the nonAggregateQueryHandling flag to 'push_project_and_filters', all query plans will read data from the Druid Index.

We will be simplifying the DDL for the case when you are starting from a Druid Index; in this case you will not need to define the Source DataFrame, we will take care of it behind the scenes.

For the case when you have table(s) in Spark that you already query the current way of writing DDL is more natural. We will be introducing a create index DDL statement for these use cases.

If you set the nonAggregateQueryHandling flag to 'push_project_and_filters', all query plans will read data from the Druid Index.

I follow this and created a table to read data from druid-index.

CREATE TABLE orderLineItemPartSupplierBase(
      o_orderkey integer, o_custkey integer,
      o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
      o_clerk string,
      o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
      l_linenumber integer,
      l_quantity double, l_extendedprice double, l_discount double, l_tax double,
      l_returnflag string,
      l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
      l_shipinstruct string,
      l_shipmode string, l_comment string, order_year string, ps_partkey integer,
      ps_suppkey integer,
      ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
      s_phone string, s_acctbal double, s_comment string, s_nation string,
      s_region string, p_name string,
      p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
      p_retailprice double,
      p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
      c_mktsegment string , c_comment string , c_nation string , c_region string);

CREATE TABLE if not exists orderLineItemPartSupplier
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
      timeDimensionColumn "l_shipdate",
      druidDatasource "tpch",
      druidHost "localhost",
      nonAggQueryHandling "push_project_and_filters",
      zkQualifyDiscoveryNames "true",
      starSchema '{"factTable" : "orderLineItemPartSupplier","relations" : []}');

But I couldn't read the data from druid.
It only happens when using spark-sql after tested with sparkline thrift I was ok.
So what is the different using spark-sql vs sparkline-thrift?

I am using spark 2.0.1 and druid 0.9.1.1 local mode (non-hdfs) and spark-druid-olap build from the master branch.

yes we only support SQL running via the sparkline thriftserver. This is because we ensure that the Sparkline components: Planner, SQL extensions, Catalog hooks etc are initialized and configured correctly when you start the sparkline thriftserver. This is a very thin wrapper on Spark's Thriftserver, see org.apache.spark.sql.hive.thriftserver.sparklinedata.HiveThriftServer2

You could setup a similar wrapper for spark-sql(or spark-shell) client along the lines of the sparkline thriftserver. Take a look at how we setup SPLSessionState in SPLScalaReflection. SPLSessionState is where we ensure all Sparkline components are setup correctly for SQL processing.

we use spark-sql to read data from druid like TPCHBENCHMARK cases, It works but we found that operations like project, filters, group by.. can not be pushed down to druid. Whether some component is not initialized correctly? The method we use is really the same to TPCHBENCHMARK.

See my comments in issue 66