why we need sourceDataframe?

Question

why we need sourceDataframe?

exitNA opened this issue 8 years ago · comments

hi:
as my understanding, druid already indexed data. why we still have a sourceDataframe like orderLineItemPartSupplierBase which providing a data source path?
creating a table schema maping to druid internal schema that should be enough. so would somebody explain it?

Harish Butani · Answer 1 · Tue Oct 18 2016 12:34:08 GMT+0800 (China Standard Time)

With support for Druid Select Query the Spark Source DataFrame is for metadata purpose only; so you could define an table in Spark with no data and then setup a Druid DataSource on top of it.

If you set the nonAggregateQueryHandling flag to 'push_project_and_filters', all query plans will read data from the Druid Index.

We will be simplifying the DDL for the case when you are starting from a Druid Index; in this case you will not need to define the Source DataFrame, we will take care of it behind the scenes.

For the case when you have table(s) in Spark that you already query the current way of writing DDL is more natural. We will be introducing a create index DDL statement for these use cases.

Chanh Le · Answer 2 · Fri Dec 02 2016 15:19:49 GMT+0800 (China Standard Time)

If you set the nonAggregateQueryHandling flag to 'push_project_and_filters', all query plans will read data from the Druid Index.

I follow this and created a table to read data from druid-index.

CREATE TABLE orderLineItemPartSupplierBase(
      o_orderkey integer, o_custkey integer,
      o_orderstatus string, o_totalprice double, o_orderdate string, o_orderpriority string,
      o_clerk string,
      o_shippriority integer, o_comment string, l_partkey integer, l_suppkey integer,
      l_linenumber integer,
      l_quantity double, l_extendedprice double, l_discount double, l_tax double,
      l_returnflag string,
      l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string,
      l_shipinstruct string,
      l_shipmode string, l_comment string, order_year string, ps_partkey integer,
      ps_suppkey integer,
      ps_availqty integer, ps_supplycost double, ps_comment string, s_name string, s_address string,
      s_phone string, s_acctbal double, s_comment string, s_nation string,
      s_region string, p_name string,
      p_mfgr string, p_brand string, p_type string, p_size integer, p_container string,
      p_retailprice double,
      p_comment string, c_name string , c_address string , c_phone string , c_acctbal double ,
      c_mktsegment string , c_comment string , c_nation string , c_region string);

CREATE TABLE if not exists orderLineItemPartSupplier
      USING org.sparklinedata.druid
      OPTIONS (sourceDataframe "orderLineItemPartSupplierBase",
      timeDimensionColumn "l_shipdate",
      druidDatasource "tpch",
      druidHost "localhost",
      nonAggQueryHandling "push_project_and_filters",
      zkQualifyDiscoveryNames "true",
      starSchema '{"factTable" : "orderLineItemPartSupplier","relations" : []}');

But I couldn't read the data from druid.
It only happens when using spark-sql after tested with sparkline thrift I was ok.
So what is the different using spark-sql vs sparkline-thrift?

I am using spark 2.0.1 and druid 0.9.1.1 local mode (non-hdfs) and spark-druid-olap build from the master branch.

Harish Butani · Answer 3 · Fri Dec 02 2016 23:28:04 GMT+0800 (China Standard Time)

yes we only support SQL running via the sparkline thriftserver. This is because we ensure that the Sparkline components: Planner, SQL extensions, Catalog hooks etc are initialized and configured correctly when you start the sparkline thriftserver. This is a very thin wrapper on Spark's Thriftserver, see org.apache.spark.sql.hive.thriftserver.sparklinedata.HiveThriftServer2

You could setup a similar wrapper for spark-sql(or spark-shell) client along the lines of the sparkline thriftserver. Take a look at how we setup SPLSessionState in SPLScalaReflection. SPLSessionState is where we ensure all Sparkline components are setup correctly for SQL processing.

spencerlisp · Answer 4 · Wed Mar 15 2017 21:03:07 GMT+0800 (China Standard Time)

we use spark-sql to read data from druid like TPCHBENCHMARK cases, It works but we found that operations like project, filters, group by.. can not be pushed down to druid. Whether some component is not initialized correctly? The method we use is really the same to TPCHBENCHMARK.

Harish Butani · Answer 5 · Thu Mar 16 2017 00:47:00 GMT+0800 (China Standard Time)

See my comments in issue 66