I’ll always remember the shiny gold star stickers I acquired for getting 100% scores on math checks in class. The celebrities have been proudly displayed on our household’s fridge door alongside my different sibling’s achievements. One thing in me nonetheless needs to earn extra of these stars.
However our expertise jobs as we speak, striving for 100% service availability and incomes that metaphorical gold star isn’t, because it seems, pretty much as good a objective because it seems. Nowadays, the gold star commonplace is intentionally set by service uptime expectations to be wanting 100%. Assembly 99.99% or 99.999%, however nothing extra, earns us the gold star on the finish of the day.
Current outages at Fb and AWS remind us of the wide-ranging impacts service disrupts have on finish clients. One could possibly be forgiven for considering outages are inevitable, since fashionable purposes are constructed with modularity in thoughts, and because of this, they more and more depend on externally hosted providers for duties akin to authentication, messaging, computing infrastructure, and so many extra providers. It’s clearly laborious to ensure every thing works completely, on a regular basis. But end-users could rely in your service availability and should even be so unforgiving that they could not return in the event that they really feel the providers is unreliable.
So why not attempt for 100% reliability, if something much less can price you enterprise? It could appear to be a worthy objective, but it surely has shortcomings.
Over-reliance on service uptime
One pitfall with providers that intentionally got down to, and even obtain 100% reliability is that different purposes changing into overly reliant on the service. Let’s be actual: providers ultimately fail, and there’s a cascading impact on purposes that aren’t constructed with the logic to resist failures of exterior providers. For instance, providers constructed with a single AWS AZ (Availability Zone) in thoughts may doubtlessly rely on 100% availability of that AZ. As I’m penning this weblog publish, I bear in mind the facility outage in in Northern Virginia that affected the us-east-1 AZ and quite a few world Net providers. Whereas AWS service has confirmed to be terribly dependable over time, assuming it will all the time be up 100% of the time proved to be unreasonable.
Possibly a few of the purposes that failed have been constructed to maintain the failure of an AZ, however constructed inside a single area. In latest reminiscence, AWS has suffered from regional outage of providers. This illustrates the necessity to develop for multi-region failures, or different contingency planning.
In the case of constructing purposes with uptime in thoughts, it’s accountable to imagine your service uptimes will fall wanting 100%. It’s as much as SREs and software builders to make use of server uptime monitoring instruments and different merchandise to automate infrastructure, and develop purposes inside the boundaries of life like SLOs (Service Stage Targets).
System resiliency suffers
Providers constructed with the idea of 100% uptime from the providers they depend on themselves implicitly should not have resiliency as a countermeasure to service interruptions. However a service that counts on compute infrastructure failures has the logic to fail over to different out there assets whereas minimizing or eliminating customer-facing interruptions. That failure mitigation design could possibly be within the type of failing over to a distinct Availability Zone (for AWS builders) or distributing an software infrastructure over completely different Availability Zones. Whatever the method, the general objective is to construct resiliency into software providers with the idea that no service is 100%.
Downtime desk for various Service Stage Agreements
There are, in fact, some methods which have achieved 100% uptime. However that’s not all the time good. A wonderfully dependable system results in complacent operators, particularly within the customers of the product. It’s greatest for SLOs to have upkeep home windows, to maintain customers of the service on top of things on retaining their total system working, even when a dependable element suffers an outage.
Providers are gradual to evolve
One other shortcoming of striving for 100% uptime is that there isn’t a alternative for main software upkeep. An SLO with 100% uptime means there may be 0% downtime. Meaning zero minutes per 12 months to carry out large-scale updates like migrating to a extra performant database or modernizing a entrance finish when an entire overhaul is known as for. Thus, providers are constrained from simply evolving to the following greatest model of themselves.
Consequently, strong SLO’s with sufficient built-in downtime present the respiration room wanted for providers to get better from unplanned and deliberate downtime to implement upkeep and enhancements.
Builders constructing as we speak’s purposes can make the most of many various providers and bolt them collectively like constructing blocks – and so they can obtain outstanding reliability. Nonetheless, striving for 100% uptime is an unreasonable expectation for the applying and different providers the applying counts on. It’s much more accountable to develop purposes with built-in resiliency.
We’d love to listen to what you suppose. Ask a query or depart a remark beneath.
And keep related with Cisco DevNet on social!