druid-io / pydruid

A Python connector for Druid

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Evaluating query runtime without output

AKheli opened this issue · comments

Hello,

I am using PyDruid to evaluate a query runtime in Druid without taking in account the results output that are obtained on the API.

from pydruid.db import connect
import time

conn = connect(host='localhost', port=8082, path='/druid/v2/sql/', scheme='http')
curs = conn.cursor()
start = time.time()
curs.execute("""
    SELECT id_station, count(*) FROM bafu_comma where id_station IN (32, 54, 8, 25, 95, 13, 80, 16, 83, 27) group by id_station
""")
end1 = time.time()
print('exeution runtime:', (end1 - start) * 1000, 'ms')
print('number of rows:', sum(1 for _ in curs))
end2 = time.time()
# for row in curs:
#      print(row)
print('total time: ',(end2 - start) * 1000, 'ms')

Is this a correct way of measuring the runtime. My execution time is always around 200ms or 50ms which is a bit suspecious. Also, the total runtime that I obtain is much higher than the results that I obtain in the API.

Any ideas on how to properly evaluate a query execution time in Druid?

Thanks!

I'm not sure if that's correct. The DB API connector will stream the results from Druid, so unless you have iterated over all the result set I don't think you can assume that the query execution has finished.

pydruid/pydruid/db/api.py

Lines 365 to 380 in bd7b741

# Druid will stream the data in chunks of 8k bytes, splitting the JSON
# between them; setting `chunk_size` to `None` makes it use the server
# size
chunks = r.iter_content(chunk_size=None, decode_unicode=True)
Row = None
for row in rows_from_chunks(chunks):
# update description
if self.description is None:
self.description = (
list(row.items()) if self.header else get_description_from_row(row)
)
# return row in namedtuple
if Row is None:
Row = namedtuple("Row", row.keys(), rename=True)
yield Row(*row.values())

The correct time is probably closer to end2 - start in this case, I think.