Devinterview-io / availability-and-reliability-interview-questions

🟣 Availability and Reliability interview questions and answers to help you prepare for your next software architecturea and design patterns interview in 2024.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Top 14 Availability & Reliability interview questions and answers in 2021.

You can check all 14 Availability & Reliability interview questions here πŸ‘‰ https://devinterview.io/design/availabilityAndReliability-interview-questions


πŸ”Ή 1. What is Availability?

Answer:

Availability refers to the probability that a system performs correctly at a specific time instance (not duration). Interruptions may occur before or after the time instance for which the system’s availability is calculated. The service must be operational and adequately satisfy the defined specifications at the time of its usage.

Availability is often quantified by uptime (or downtime) as a percentage of time the service is available. Availability is generally measured in number of 9s--a service with 99.99% availability is described as having four 9s.

Source: www.bmc.com   


πŸ”Ή 2. What is Reliability?

Answer:

Reliability is the probability that a system performs correctly during a specific time duration. During this correct operation, no repair is required or performed, and the system adequately follows the defined performance specifications.

Reliability follows an exponential failure law, which means that it reduces as the time duration considered for reliability calculations elapses. In other words, reliability of a system will be high at its initial state of operation and gradually reduce to its lowest magnitude over time.

Source: www.bmc.com   


πŸ”Ή 3. What is Back-Pressure?

Answer:

When one component is struggling to keep-up, the system as a whole needs to respond in a sensible way. It is unacceptable for the component under stress to fail catastrophically or to drop messages in an uncontrolled fashion. Since it can’t cope and it can’t fail it should communicate the fact that it is under stress to upstream components and so get them to reduce the load.

This back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it. The back-pressure may cascade all the way up to the user, at which point responsiveness may degrade, but this mechanism will ensure that the system is resilient under load, and will provide information that may allow the system itself to apply other resources to help distribute the load.

Source: reactivemanifesto.org   


πŸ”Ή 4. How Do You Update A Live Heavy Traffic Site With Minimum Or Zero Down Time?

Answer:

Deploying a newer version of a live website can be a challenging task specially when a website has high traffic. Any downtime is going to affect the users. There are a few best practices that we can follow:

Before deploying on Production:

  • Thoroughly test the new changes and ensure it working in a test environment which is almost identical to production system.
  • If possible do automation of test cases as much as possible.
  • Create a automated sanity testing script (also called as smoke test) that can be run on production (without affecting real data). These are typically readonly type of test cases. However depending on your application needs you can add more cases to this. Make sure it can be run quickly by keeping it short.
  • Create scripts for all manual tasks(if possible), avoiding any hand typing mistakes during day of deployment.
  • Test the script to make sure they work on a non-production environment.
  • Keep the build artifacts ready. e.g application deployment files, database scripts, config files etc.
  • Create a checklist of things to do on day of deployment.
  • Rehearse. Deploy in a non-prod environment is almost identical to production. Try this with production data volumes(if possible). Make a note of time required for your tasks so you can plan accordingly.

When doing deploying on a production environment:

  • Use Green-Blue deployment technique to reduce down-time risk
  • Keep backup of current site/data to be able to rollback
  • Use sanity test cases before doing a lot of in depth testing
Source: fromdev.com   


πŸ”Ή 5. What Do You Mean By High Availability (HA)?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 6. What does it mean "System Shall Be Resilient"?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 7. What is Fail-over?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 8. Explain Failure in Contrast to Error

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 9. How to choose between CP (consistency) and AP (availability)?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 10. Explain how does Active-Passive Fail-over work?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 11. What is Active-Active Fail-over?

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 12. Compare "Fail Fast" vs "Robust" approaches of building software

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 13. Explain how to calculate Availability of multiple system components

πŸ‘‰πŸΌ Check all 14 answers


πŸ”Ή 14. What is a crashloop?

πŸ‘‰πŸΌ Check all 14 answers



Thanks πŸ™Œ for reading and good luck on your next tech interview!
Explore 3800+ dev interview question here πŸ‘‰ Devinterview.io