Read Change Data Feed from a Table returns "change data was not recorded for version *" error
LazyRen opened this issue · comments
Hello, I am working on a standalone Delta Sharing server with S3 storage ATM.
Below is the PySpark Script I've used to create & update table.
from pyspark.sql import SparkSession
from delta import *
s3_url = "<my_url>"
builder = SparkSession.builder \
.appName("quickstart") \
.master("local[*]") \
.config("spark.hadoop.fs.s3a.path.style.access", True) \
.config(
"spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem"
) \
.config("spark.jars", "/aws-java-sdk-1.12.424/lib/aws-java-sdk-1.12.424.jar") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.sql.catalog.spark_catalog.type", "hadoop") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config('spark.hadoop.fs.s3a.access.key', "<access_key>") \
.config('spark.hadoop.fs.s3a.secret.key', "<secret_key>") \
.config('spark.jars', "/hadoop-aws-3.3.2.jar, /aws-java-sdk-bundle-1.12.425.jar") \
.config("spark.sql.warehouse.dir", s3_url)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sql('''CREATE OR REPLACE TABLE `courses`(
cid Integer,
department String,
cap Integer,
instructor String,
easy Double,
useful Double
)
USING DELTA
TBLPROPERTIES (delta.enableChangeDataFeed = true)
''')
# Version 0 - insert
spark.sql('''INSERT INTO `courses` VALUES
(341, 'CS', 80, 'Armin Jamshidpey', 0.54, 0.99),
(347, 'PMATH', 65, 'David McKinnon', 0.43, 0.90),
(488, 'CS', 45, 'Toshiya Hachisuka', 0.58, 0.46),
(104, 'PHY', 1200, 'Joseph Smith', 0.92, 0.25),
(130, 'PHY', 250, 'Maria Elrena', 0.79, 0.63),
(140, 'MUSIC', 200, 'Simon Wood', 0.87, 0.58),
(246, 'CS', 90, 'Mark Petrick', 0.59, 0.66),
(250, 'AMATH', 200, 'John Wick', 0.75, 0.86),
(251, 'AMATH', 200, 'OPT 347', 0.75, 0.86),
(370, 'CS', 105, 'Jeff Orchard', 0.61, 0.86)
''')
# Version 1 - delete
spark.sql('''DELETE FROM `courses` WHERE
department = 'AMATH'
''')
# Version 2 - update
spark.sql('''UPDATE `courses`
SET instructor = 'Richard Feynman' WHERE department = 'PHY'
''')
I have enabled CDF by TBLPROPERTIES (delta.enableChangeDataFeed = true)
.
And executing
spark.read.format("delta") \
.option("readChangeFeed", "true") \
.option("startingVersion", 1) \
.option("endingVersion", 1) \
.table("`courses`") \
.show(100)
returns a proper records, while calling Read Change Data Feed from a Table
returns
ec2-user:~> curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1&endingVersion=1
[1] 24330
ec2-user:~> {"errorCode":"INVALID_PARAMETER_VALUE","message":"Error getting change data for range [1, 2] as change data was not recorded for version [1]"}
[1]+ Done curl http://localhost:8080/delta-sharing/shares/demos3select/schemas/school/tables/courses/changes?startingVersion=1
Both DS 0.7.4 & DS 1.0.2 returns the same error.
I have attached delta table data below. Could you please check what went wrong here?
Thank you.
(By the way, where can I see the documentation on the yaml config file? Took me a good 5 minute to figure out what went wrong after updating DS 0.7.4 to 1.0.2. Turns out config name cdfEnabled
was changed to historyShared
)
DS server config:
# The format version of this config file
version: 1
# Config shares/schemas/tables to share
shares:
- name: "demos3select"
schemas:
- name: "school"
tables:
- name: "courses"
location: "s3a://<s3_url>/delta_lake/poc/courses"
id: "00000000-0000-0000-0000-000000000000"
historyShared: true