druid-io / pydruid

Hello,

I am using PyDruid to evaluate a query runtime in Druid without taking in account the results output that are obtained on the API.

from pydruid.db import connect
import time

conn = connect(host='localhost', port=8082, path='/druid/v2/sql/', scheme='http')
curs = conn.cursor()
start = time.time()
curs.execute("""
    SELECT id_station, count(*) FROM bafu_comma where id_station IN (32, 54, 8, 25, 95, 13, 80, 16, 83, 27) group by id_station
""")
end1 = time.time()
print('exeution runtime:', (end1 - start) * 1000, 'ms')
print('number of rows:', sum(1 for _ in curs))
end2 = time.time()
# for row in curs:
#      print(row)
print('total time: ',(end2 - start) * 1000, 'ms')

Is this a correct way of measuring the runtime. My execution time is always around 200ms or 50ms which is a bit suspecious. Also, the total runtime that I obtain is much higher than the results that I obtain in the API.

Any ideas on how to properly evaluate a query execution time in Druid?

Thanks!

I'm not sure if that's correct. The DB API connector will stream the results from Druid, so unless you have iterated over all the result set I don't think you can assume that the query execution has finished.

pydruid/pydruid/db/api.py

Lines 365 to 380 in bd7b741

    
           # Druid will stream the data in chunks of 8k bytes, splitting the JSON 
        
           # between them; setting `chunk_size` to `None` makes it use the server 
        
           # size 
        
           chunks = r.iter_content(chunk_size=None, decode_unicode=True) 
        
           Row = None 
        
           for row in rows_from_chunks(chunks): 
        
               # update description 
        
               if self.description is None: 
        
                   self.description = ( 
        
                       list(row.items()) if self.header else get_description_from_row(row) 
        
                   ) 
        
               # return row in namedtuple 
        
               if Row is None: 
        
                   Row = namedtuple("Row", row.keys(), rename=True) 
        
               yield Row(*row.values())

The correct time is probably closer to end2 - start in this case, I think.

	# Druid will stream the data in chunks of 8k bytes, splitting the JSON
	# between them; setting `chunk_size` to `None` makes it use the server
	# size
	chunks = r.iter_content(chunk_size=None, decode_unicode=True)
	Row = None
	for row in rows_from_chunks(chunks):
	# update description
	if self.description is None:
	self.description = (
	list(row.items()) if self.header else get_description_from_row(row)
	)

	# return row in namedtuple
	if Row is None:
	Row = namedtuple("Row", row.keys(), rename=True)
	yield Row(*row.values())

Evaluating query runtime without output