databricks / koalas

Koalas: pandas API on Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Does Koalas support reading hive table by default?

amznero opened this issue · comments

Hi,

I'm trying to use Koalas to load a hive table on the remote cluster. In https://koalas.readthedocs.io/en/latest/reference/io.html#spark-metastore-table, it says that I can use ks.read_table API to read spark-table, but it failed when I use ks.read_table to read the table.

import pandas as pd
import numpy as np
import databricks.koalas as ks
from pyspark.sql import SparkSession

koalas_df = ks.read_table("xxx.yyy")

Error log:

AnalysisException: "Table or view not found: `xxx`.`yyy`;;\n'UnresolvedRelation `xxx`.`yyy`\n"

However, I can load it successfully by directly using pyspark+pandas+pyarrow.

some snippets

from pyspark.sql import SparkSession

spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

spark_df = spark.read.table("xxx")
pandas_df = spark_df.toPandas()
...

And I check some source codes in

def read_table(name: str, index_col: Optional[Union[str, List[str]]] = None) -> DataFrame:

It uses default_session(without option configures) to load the table, but it does not set enableHiveSupport option.

def default_session(conf=None):
if conf is None:
conf = dict()
should_use_legacy_ipc = False
if LooseVersion(pyarrow.__version__) >= LooseVersion("0.15") and LooseVersion(
pyspark.__version__
) < LooseVersion("3.0"):
conf["spark.executorEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.yarn.appMasterEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.mesos.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
conf["spark.kubernetes.driverEnv.ARROW_PRE_0_15_IPC_FORMAT"] = "1"
should_use_legacy_ipc = True
builder = spark.SparkSession.builder.appName("Koalas")
for key, value in conf.items():
builder = builder.config(key, value)
# Currently, Koalas is dependent on such join due to 'compute.ops_on_diff_frames'
# configuration. This is needed with Spark 3.0+.
builder.config("spark.sql.analyzer.failAmbiguousSelfJoin", False)
if LooseVersion(pyspark.__version__) >= LooseVersion("3.0.1") and is_testing():
builder.config("spark.executor.allowSparkContext", False)
session = builder.getOrCreate()

So, I'm a little confused about ks.read_table, where does it load tables from?
Maybe link to Spark-warehouse?