druid-io / druid-io.github.io

Druid Project Website

Home Page:http://druid.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

quantile (99%) aggregation with druid and hive

datametrics opened this issue · comments

Hi all together,

i've got a question concerning data ingestion from inside hive.
I use a query similar to this one for indexing a result view in Druid.

use LONGFIELD_DWH;
DROP TABLE IF EXISTS TestCube;
EXPLAIN CREATE TABLE TestCube
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES (
"druid.segment.granularity" = "MONTH",
"druid.query.granularity" = "DAY")
AS
SELECT
cast(DimDate.Name as timestamp with local time zone) as __time,
cast(DimInstrument.Name as string) InsName,
FactTable.TestMeasure * p.Quantity as TestMeasure,
cast(DimInstrument.InstrumentType as string) InstrumentType,
cast(FactTable.IdScenario as string) Scenario,
cast(p.Portfolio as string) Portfolio,
p.Quantity as Quantity
FROM FactTable
JOIN DimDate on (FactTable.IdDate = DimDate.Id)
JOIN DimInstrument on (FactTable.IdInstrument = DimInstrument.Id)
JOIN DimScenario on (FactTable.IdScenario = DimScenario.Id)
JOIN (select pos.Name, pos.IdInstrument, pos.IdPortfolio,pf.Name as Portfolio,
pos.Quantity from FactPosition pos
join DimInstrument ins on ins.Id = pos.IdInstrument
join DimPortfolio pf on pf.Id = pos.IdPortfolio) p
on DimInstrument.Id = p.IdInstrument;

That works without any issues. The datasource gets published in druid and is visible
in superset after querying datasource metadata. A segment is created for every day
in the facttable.

As part of my analysis i would like to build a postaggregated quantile measure like
described on the project documentation page:

At first i create a new metric in superset "HISTOGRAM" of type "approxHistogramFold" with this JSON:

{
"type" : "approxHistogramFold",
"name" : "HISTOGRAM",
"fieldName" : "sum__testmeasure",
"resolution" : 500,
"numBuckets" : 500,
"lowerLimit" : 0.0
}

Next, i create another metric in superset "Response_99" (type = "postagg")that should display the
fifth-worst (i.e. the 99% quantile) item from the list of 500 ordered testmeasure items
from the approxhistogram measure (JSON):

{
"type": "quantile",
"name": "HISTOGRAM",
"fieldName": "HISTOGRAM",
"probability": "0.99"
}

That measure combination runs but the output is "na". As far as I understand the documentation the field
of the first aggregator for "testmeasure" has to be ingested with a specific metric-spec / ingestion
aggregator similar to this:

http://druid.io/docs/latest/development/extensions-core/datasketches-aggregators.html

Does anyone know how to incorporate this into the ingestion query issued in HIVE?
If not, does anyone see how to mimic such quantile measures maybe without "approxHistogramFold" and
probably in a simple javascript postaggregation?

Thanks in advance!