acroz / pylivy

A Python client for Apache Livy, enabling use of remote Apache Spark clusters.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pylivy sessions with hive

iliakliuchnikov opened this issue · comments

commented

Hi!
Im trying to start livy sessions with hive:

from livy import LivySession

import datetime


LIVY_URL = "http://mylivy:80"

with LivySession.create(
        LIVY_URL,
        jars=[
            "gs://mybacket/hotfix/jars/iceberg-spark3-runtime-0.9.0.jar",
            "gs://mybacket/hotfix/jars/spark_etl-1.0-SNAPSHOT.jar",
            "gs://mybacket/hotfix/jars/spark-bigquery-with-dependencies_2.12-0.17.3.jar"
        ],
        py_files=["gs://mybacket/hotfix/dags/package.zip"],
        num_executors=1,
        name=f"add-attribution-window-hours-{datetime.datetime.now()}",
        spark_conf={
            "spark.kubernetes.container.image.pullPolicy": "Always",
            "spark.kubernetes.driverEnv.ETL_ENV": "prod",
            "spark.executorEnv.ETL_ENV": "prod",
            "spark.kubernetes.driverEnv.HIVE_CONF_DIR": "/opt/spark/conf/hive-site",
            "spark.sql.warehouse.dir": "gs://mybacket/hive/",
            "spark.sql.catalogImplementation": "hive",
            "spark.kubernetes.driver.secrets.hive-site": "/opt/spark/conf/hive-site",
            "spark.executor.memory": "16g",
            "spark.executor.cores": "6",
            "spark.eventLog.enabled": "true",
            "spark.kubernetes.namespace": "default"
        }
    ) as session:
    
    # Run some code on the remote cluster
    session.run("spark.sql('show databases;').show(20, False)")
    # Retrieve the result
    #local_df = session.download("df")
    #local_df.show()

but 'show databases' in hive always empty (like local empty hive-metastore)
in log i see its trying to start hive:

21/12/01 12:22:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('gs://mybucket/hive/').
21/12/01 12:22:24 INFO SharedState: Warehouse path is 'gs://mybucket/hive/'.
21/12/01 12:22:26 INFO CodeGenerator: Code generated in 267.63079 ms
21/12/01 12:22:26 INFO CodeGenerator: Code generated in 11.710265 ms
21/12/01 12:22:26 INFO CodeGenerator: Code generated in 17.16884 ms

when i start livy batch with same spark_conf, always working fine, i have access to all tables, and log looks like that:

21/12/01 09:29:03 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site/hive-site.xml
21/12/01 09:29:03 INFO HiveUtils: Initializing HiveMetastoreConnection version 2.3.7 using Spark classes.
21/12/01 09:29:03 INFO HiveConf: Found configuration file file:/opt/spark/conf/hive-site/hive-site.xml
21/12/01 09:29:03 INFO SessionState: Created HDFS directory: /tmp/hive/root
21/12/01 09:29:03 INFO SessionState: Created local directory: /tmp/root
21/12/01 09:29:03 INFO SessionState: Created HDFS directory: /tmp/hive/root/c42cf693-a56b-44d4-8b4f-5b67ed85c721
21/12/01 09:29:03 INFO SessionState: Created local directory: /tmp/root/c42cf693-a56b-44d4-8b4f-5b67ed85c721
21/12/01 09:29:03 INFO SessionState: Created HDFS directory: /tmp/hive/root/c42cf693-a56b-44d4-8b4f-5b67ed85c721/_tmp_space.db
21/12/01 09:29:03 INFO HiveClientImpl: Warehouse location for Hive client (version 2.3.7) is gs://mybucket/hive/
21/12/01 09:29:04 INFO metastore: Trying to connect to metastore with URI thrift://myhive-metastore.us-north1-a.c.myproject.internal:9083
21/12/01 09:29:04 INFO metastore: Opened a connection to metastore, current connections: 1
21/12/01 09:29:04 INFO metastore: Connected to metastore.

how to correctly set hive-config for livy sessions?

commented

answer:
in config just set
livy.repl.enable-hive-context = true
and hive enabled on sessions