Question about the non-disruptive upgrade

Question

Question about the non-disruptive upgrade

happycoderincloud opened this issue a year ago · comments

happycoderincloud commented a year ago

Are you reporting an issue with existing content?

I'm new to the mayastor. Just read the upgrade in the official document. The doc says the upgrade is non-disruptive.

Some questions for the awesome features

What does non-disruptive upgrade mean? Does it mean an IO won't be interrupted during upgrade?
If an application workload and its volume are on the same node. Will the workload and the volume be moved to another node during upgrade? Then, the IO should be interrupted? If yes, is it disruptive upgrade?
Should volumes be HA for non-disruptive upgrade? The HA means multiple replicas, or the feature mentioned in https://mayastor.gitbook.io/introduction/advanced-operations/ha?

Thanks for the clarification in advance.

Are you proposing new content, or a change to the existing documentation layout or structure?
Please describe your proposal.

Tiago Castro · Answer 1 · Mon Oct 09 2023 22:49:44 GMT+0800 (China Standard Time)

Hi @happycoderincloud,

IO might be temporarily "stalled" but there will be no IO failures at any time
No, the application is not moved and it doesn't need to be restarted.
Tbh the name we've gone with for what is technically called failover is not really the best as it could be interpreted in different ways.. What we mean in that page by HA is actually is basically on-demand switchover: when application nvme initiator has connection issues with a target, we move the target to another node and let initiator connect to the new target, without ever failing IO back to application.
For upgrades ideally volumes should contain >1 replicas. Consider what happens when we want to restart mayastor dataplane pod a, and we have a single volume replica living in pod a... well as we're restarting the pod replica will not be available for some time.. so our volume target may see IO errors which may be propagated to the application node.
We've tried to work around this issue on v2.4 (IIRC) by moving the volume target the same node as the replica, so when the pod is being restarted, the target is also restarted and the application again, sees no IO errors.

happycoderincloud · Answer 2 · Wed May 08 2024 08:33:25 GMT+0800 (China Standard Time)

@tiagolobocastro
Thank you. One more question, do you use nvme multipath in the switchover for upgrade? If so, can you elaborate more on the mechanism? Thanks.