Issue writing Panda Dataframe back to Impala

Question

Issue writing Panda Dataframe back to Impala

waldenrt opened this issue 5 years ago · comments

I have the following code:

from impala.dbapi import connect
from impala.util import as_pandas
import sys
import sqlalchemy
conn = connect(host='impalaprd.tenethealth.net',
port=21050,
use_ssl=True,
auth_mechanism='GSSAPI',
kerberos_service_name='impala')
cursor = conn.cursor()
engine = sqlalchemy.create_engine('impala://', creator=conn)
sql = 'SELECT DBID from ACEDTA.DBINFO LIMIT 1000'
cursor.execute(sql)
mypanda = as_pandas(cursor)
print(mypanda)
mypanda.to_sql('default.dbinfo_dataframe_bulkload',engine)

When I call panda.to_sql, I get the following error:

'HiveServer2Connection' object is not callable

Am I missing a package installed in my environment or something with the code?

Thanks,

Richard

Rafael Reuber · Answer 1 · Sat Mar 14 2020 09:52:41 GMT+0800 (China Standard Time)

from impala.dbapi import connect
from impala.util import as_pandas
import sqlalchemy

conn = connect(host='impalaprd.tenethealth.net',
    port=21050,
    use_ssl=True,
    auth_mechanism='GSSAPI',
    kerberos_service_name='impala')

cursor = conn.cursor()
engine = sqlalchemy.create_engine('impala://', creator=conn)
sql = 'SELECT DBID from ACEDTA.DBINFO LIMIT 1000'
cursor.execute(sql)

mypanda = as_pandas(cursor)
print(mypanda)

mypanda.to_sql('default.dbinfo_dataframe_bulkload', engine)

It' seems you issue is is about connection. In which line the error happens?

Luciano Issoe · Answer 2 · Tue Jul 07 2020 09:28:04 GMT+0800 (China Standard Time)

I am having the same problem

python 3.6.10
sqlalchemy 1.3.18
impyla 0.16.2

seems like the function expects a connection factory, instead of a single Hive connection.

/home/cdsw/.local/lib/python3.6/site-packages/sqlalchemy/pool/base.py in __connect(self, first_connect_check)
654 try:
655 self.starttime = time.time()
--> 656 connection = pool._invoke_creator(self)
657 pool.logger.debug("Created new connection %r", connection)
658 self.connection = connection

/home/cdsw/.local/lib/python3.6/site-packages/sqlalchemy/pool/base.py in (crec)
247 argspec = util.get_callable_argspec(self._creator, no_self=True)
248 except TypeError:
--> 249 return lambda crec: creator()
250
251 defaulted = argspec[3] is not None and len(argspec[3]) or 0

TypeError: 'HiveServer2Connection' object is not callable

Luciano Issoe · Answer 3 · Tue Jul 07 2020 09:29:12 GMT+0800 (China Standard Time)

@rafaelreuber can you tell me what is your sqlalchemy version ?

Luciano Issoe · Answer 4 · Tue Jul 07 2020 10:01:02 GMT+0800 (China Standard Time)

I managed to make it work, but using the connection url :

engine = sqlalchemy.create_engine('impala://host:port/database?use_ssl=True&auth_mechanism=GSSAPI')

And then assigning String type to a TEXT field using dtype.

from sqlalchemy.types import String

mypanda.to_sql('poc_dashboard_test', engine, if_exists='append', index=False, dtype={'name':String()})