redis / riot

🧨 Get data in & out of Redis with RIOT

Home Page:http://redis.github.io/riot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

riot-redis replicate is not robust to transient errors

bbayles opened this issue · comments

I've been using riot-redis to aid in some migrations. I've hit this problem a few times: when I get to the second phase (the "live only" phase), and let it run for a while, I eventually hit some hiccup that causes the whole thing to die.

In my case, I'd like to get as many of the transactions from the source Redis to the destination Redis as possible, but if I miss one I'd like to keep going - not have to start over.

So here's the request: in "live only" mode, catch most-likely-transient errors and continue listening for key changes.

Log output with stack trace
01:58:08.338 SEVERE org.springframework.batch.core.step.AbstractStep	: Encountered an error executing step LiveKeyValueItemReader-step in job LiveKeyValueItemReader-job
java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at org.springframework.batch.item.redis.support.KeyDumpValueReader.read(KeyDumpValueReader.java:38)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:61)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:23)
	at org.springframework.batch.item.redis.support.KeyValueItemReader$ValueWriter.write(KeyValueItemReader.java:185)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.write(SimpleChunkProcessor.java:294)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
	at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

01:59:08.345 SEVERE org.springframework.batch.core.step.AbstractStep	: Exception while closing step execution resources in step LiveKeyValueItemReader-step in job LiveKeyValueItemReader-job
io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:75)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy13.punsubscribe(Unknown Source)
	at org.springframework.batch.item.redis.support.RedisKeyspaceNotificationItemReader.unsubscribe(RedisKeyspaceNotificationItemReader.java:46)
	at org.springframework.batch.item.redis.support.AbstractKeyspaceNotificationItemReader.close(AbstractKeyspaceNotificationItemReader.java:65)
	at org.springframework.batch.item.support.CompositeItemStream.close(CompositeItemStream.java:90)
	at org.springframework.batch.core.step.tasklet.TaskletStep.close(TaskletStep.java:306)
	at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:287)
	at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
	at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
	at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
	at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
	at java.base/java.lang.Thread.run(Thread.java:829)

Many thanks for this very useful tool.

Thanks for reporting this issue. I puahed an early access release with some changes around exception handling. Could you give it a try and see if it works with the default skip policy (ALWAYS)?

I'll give it a go - thanks!

Alas, it failed in the same way:

19:24:37.407 SEVERE org.springframework.batch.core.step.AbstractStep	: Encountered an error executing step LiveKeyValueItemReader-step in job LiveKeyValueItemReader-job
java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at org.springframework.batch.item.redis.support.KeyDumpValueReader.read(KeyDumpValueReader.java:38)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:61)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:23)
	at org.springframework.batch.item.redis.support.KeyValueItemReader$ValueWriter.write(KeyValueItemReader.java:185)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.write(SimpleChunkProcessor.java:294)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
	at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

19:25:37.414 SEVERE org.springframework.batch.core.step.AbstractStep	: Exception while closing step execution resources in step LiveKeyValueItemReader-step in job LiveKeyValueItemReader-job
io.lettuce.core.RedisCommandTimeoutException: Command timed out after 1 minute(s)
	at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
	at io.lettuce.core.internal.Futures.awaitOrCancel(Futures.java:246)
	at io.lettuce.core.FutureSyncInvocationHandler.handleInvocation(FutureSyncInvocationHandler.java:75)
	at io.lettuce.core.internal.AbstractInvocationHandler.invoke(AbstractInvocationHandler.java:80)
	at com.sun.proxy.$Proxy13.punsubscribe(Unknown Source)
	at org.springframework.batch.item.redis.support.RedisKeyspaceNotificationItemReader.unsubscribe(RedisKeyspaceNotificationItemReader.java:46)
	at org.springframework.batch.item.redis.support.AbstractKeyspaceNotificationItemReader.close(AbstractKeyspaceNotificationItemReader.java:65)
	at org.springframework.batch.item.support.CompositeItemStream.close(CompositeItemStream.java:90)
	at org.springframework.batch.core.step.tasklet.TaskletStep.close(TaskletStep.java:306)
	at org.springframework.batch.core.step.AbstractStep.execute(AbstractStep.java:287)
	at org.springframework.batch.core.job.SimpleStepHandler.handleStep(SimpleStepHandler.java:152)
	at org.springframework.batch.core.job.AbstractJob.handleStep(AbstractJob.java:413)
	at org.springframework.batch.core.job.SimpleJob.doExecute(SimpleJob.java:136)
	at org.springframework.batch.core.job.AbstractJob.execute(AbstractJob.java:320)
	at org.springframework.batch.core.launch.support.SimpleJobLauncher$1.run(SimpleJobLauncher.java:149)
	at java.base/java.lang.Thread.run(Thread.java:829)

Sorry I forgot to add TimeoutException to the list of skippable exceptions. Could you try again with the new early-access release? https://github.com/redis-developer/riot/releases/tag/early-access

Alas, this one bombed out for me too:

12:47:30.881 SEVERE org.springframework.batch.core.step.AbstractStep	: Encountered an error executing step LiveKeyValueItemReader-step in job LiveKeyValueItemReader-job
java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886)
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021)
	at org.springframework.batch.item.redis.support.KeyDumpValueReader.read(KeyDumpValueReader.java:38)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:61)
	at org.springframework.batch.item.redis.support.AbstractKeyValueReader.process(AbstractKeyValueReader.java:23)
	at org.springframework.batch.item.redis.support.KeyValueItemReader$ValueWriter.write(KeyValueItemReader.java:185)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.writeItems(SimpleChunkProcessor.java:193)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.doWrite(SimpleChunkProcessor.java:159)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.write(SimpleChunkProcessor.java:294)
	at org.springframework.batch.core.step.item.SimpleChunkProcessor.process(SimpleChunkProcessor.java:217)
	at org.springframework.batch.core.step.item.ChunkOrientedTasklet.execute(ChunkOrientedTasklet.java:77)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:407)
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331)
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140)
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273)
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82)
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

I double-checked to make sure I had pulled the version from the new earl release, FYI.

Also, this is probably a separate issue, but it prevents me from working around this one: when this exception gets hit, the process doesn't die. That is, I still have java ... com.redislabs.riot.redis.RiotRedis in my process list. So I can't use systemd to restart the liveonly copy process when it stops working.

Alright, I think I found the root cause. The fault-tolerant step was not actually used for the live part of the replication process. Could you try the latest early-access release?

Thanks for your attention to this! I'll give it a shot.

On the new release I got these warnings at startup:

data being stored.
13:43:34.771 WARNING org.springframework.batch.core.step.item.ChunkMonitor	: No ItemReader set (must be concurrent step), so ignoring offset data.
13:43:35.087 WARNING org.springframework.batch.core.step.item.ChunkMonitor	: ItemStream was opened in a different thread.  Restart data could be compromised.

Later I got my usual TimeoutException. But I also got this handshake problem:

14:00:23.432 WARNING io.lettuce.core.protocol.ConnectionWatchdog	: Cannot reconnect to [master.anz-prod1-rq.owqy4a.apse2.cache.amazonaws.com:6379]: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
java.util.concurrent.CompletionException: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1063)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at io.lettuce.core.RedisHandshake.lambda$tryHandshakeResp3$1(RedisHandshake.java:105)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at io.lettuce.core.protocol.AsyncCommand.doCompleteExceptionally(AsyncCommand.java:139)
	at io.lettuce.core.protocol.AsyncCommand.completeExceptionally(AsyncCommand.java:132)
	at io.lettuce.core.RedisHandshake.lambda$dispatch$5(RedisHandshake.java:224)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
	at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
	at io.netty.channel.DelegatingChannelPromiseNotifier.operationComplete(DelegatingChannelPromiseNotifier.java:57)
	at io.netty.channel.DelegatingChannelPromiseNotifier.operationComplete(DelegatingChannelPromiseNotifier.java:31)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
	at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
	at io.netty.channel.DelegatingChannelPromiseNotifier.operationComplete(DelegatingChannelPromiseNotifier.java:57)
	at io.netty.channel.DelegatingChannelPromiseNotifier.operationComplete(DelegatingChannelPromiseNotifier.java:31)
	at io.netty.channel.AbstractCoalescingBufferQueue.releaseAndCompleteAll(AbstractCoalescingBufferQueue.java:350)
	at io.netty.channel.AbstractCoalescingBufferQueue.releaseAndFailAll(AbstractCoalescingBufferQueue.java:208)
	at io.netty.handler.ssl.SslHandler.releaseAndFailAll(SslHandler.java:1823)
	at io.netty.handler.ssl.SslHandler.access$2300(SslHandler.java:171)
	at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2036)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms
	at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2029)
	... 9 more

It looks like the live replication died after that.

Not sure what causes that handshake issue, but I think I found another place where exceptions were not handled properly: redis/spring-batch-redis#45

Could you give it another try? Hopefully this 3rd time is the charm