delta-io / delta-sharing

An open protocol for secure data sharing

Home Page:https://delta.io/sharing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Read Change Data Feed from a Table returns "change data was not recorded for version *" error

LazyRen opened this issue · comments

Hello, I am working on a standalone Delta Sharing server with S3 storage ATM.

Below is the PySpark Script I've used to create & update table.

from pyspark.sql import SparkSession
from delta import *

s3_url = "<my_url>"

builder = SparkSession.builder \
    .appName("quickstart") \
    .master("local[*]") \
    .config("spark.hadoop.fs.s3a.path.style.access", True) \
    .config(
      "spark.hadoop.fs.s3a.impl",
      "org.apache.hadoop.fs.s3a.S3AFileSystem"
    ) \
    .config("spark.jars", "/aws-java-sdk-1.12.424/lib/aws-java-sdk-1.12.424.jar") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.catalog.spark_catalog.type", "hadoop") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config('spark.hadoop.fs.s3a.access.key', "<access_key>") \
    .config('spark.hadoop.fs.s3a.secret.key', "<secret_key>") \
    .config('spark.jars', "/hadoop-aws-3.3.2.jar, /aws-java-sdk-bundle-1.12.425.jar") \
    .config("spark.sql.warehouse.dir", s3_url)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

spark.sql('''CREATE OR REPLACE TABLE `courses`(
    cid                 Integer,
    department          String,
    cap                 Integer,
    instructor          String,
    easy                Double,
    useful              Double
)
USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed = true)
''')

# Version 0 - insert
spark.sql('''INSERT INTO `courses` VALUES
(341, 'CS', 80, 'Armin Jamshidpey', 0.54, 0.99),
(347, 'PMATH', 65, 'David McKinnon', 0.43, 0.90),
(488, 'CS', 45, 'Toshiya Hachisuka', 0.58, 0.46),
(104, 'PHY', 1200, 'Joseph Smith', 0.92, 0.25),
(130, 'PHY', 250, 'Maria Elrena', 0.79, 0.63),
(140, 'MUSIC', 200, 'Simon Wood', 0.87, 0.58),
(246, 'CS', 90, 'Mark Petrick', 0.59, 0.66),
(250, 'AMATH', 200, 'John Wick', 0.75, 0.86),
(251, 'AMATH', 200, 'OPT 347', 0.75, 0.86),
(370, 'CS', 105, 'Jeff Orchard', 0.61, 0.86)
''')

# Version 1 - delete
spark.sql('''DELETE FROM `courses` WHERE
department = 'AMATH'
''')

# Version 2 - update
spark.sql('''UPDATE `courses`
SET instructor = 'Richard Feynman' WHERE department = 'PHY'
''')

I have enabled CDF by TBLPROPERTIES (delta.enableChangeDataFeed = true).

And executing

spark.read.format("delta") \
  .option("readChangeFeed", "true") \
  .option("startingVersion", 1) \
  .option("endingVersion", 1) \
  .table("`courses`") \
  .show(100)

returns a proper records, while calling Read Change Data Feed from a Table returns

ec2-user:~> curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1&endingVersion=1
[1] 24330
ec2-user:~> {"errorCode":"INVALID_PARAMETER_VALUE","message":"Error getting change data for range [1, 2] as change data was not recorded for version [1]"}
[1]+  Done                    curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1

Both DS 0.7.4 & DS 1.0.2 returns the same error.

I have attached delta table data below. Could you please check what went wrong here?
Thank you.

(By the way, where can I see the documentation on the yaml config file? Took me a good 5 minute to figure out what went wrong after updating DS 0.7.4 to 1.0.2. Turns out config name cdfEnabled was changed to historyShared)

DS server config:

# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "demos3select"
  schemas:
  - name: "school"
    tables:
    - name: "courses"
      location: "s3a://<s3_url>/delta_lake/poc/courses"
      id: "00000000-0000-0000-0000-000000000000"
      historyShared: true

courses.zip