Append mode doesn't replace entire row if key collision happens
tsekityam opened this issue · comments
What did I do
df = (
spark
.sql("SELECT 'test' AS key, 123 AS col_a, 223 AS col_b")
)
(
df
.write
.format("org.apache.spark.sql.redis")
.option("host", redis_host)
.option("port", redis_port)
.option("ssl", "true")
.option("table", "test_append_behavour")
.option("key.column", "key")
.mode("overwrite")
.save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))
# {b'col_b': b'223', b'col_a': b'123'}
df2 = (
spark
.sql("SELECT 'test' AS key, 324 AS col_a, 423 AS col_c")
)
(
df2
.write
.format("org.apache.spark.sql.redis")
.option("host", redis_host)
.option("port", redis_port)
.option("ssl", "true")
.option("table", "test_append_behavour")
.option("key.column", "key")
.mode("append")
.save()
)
r = redis.Redis(host=redis_host, port=redis_port, db=0, ssl=True)
print(r.hgetall("test_append_behavour:test"))
# {b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}
What did I see
test_append_behavour:test
now has 3 fields
{b'col_b': b'223', b'col_a': b'324', b'col_c': b'423'}
What did I expect
test_append_behavour:test
should only have 2 fields from df2
{b'col_a': b'324', b'col_c': b'423'}
Please note, when key collision happens and SaveMode.Append is set, the former row is replaced with a new one.
According to the docs, the row of df1
should be replace by df2
in append mode, because they share the same key.
However, the col_a
from df1
is still there after append, that means not entire row is replaced. We only replace the field if there is any key collision.
Hi @tsekityam ,
the SaveMode.Append
uses hmset
command internally, so it may not completely overwrite the row if the scheme of new dataframe is different. You are right, the documentation is not accurate.