I was reading a blog post from Speculative Branches that touched on cloud computing topics relating to vertical vs horizontal scaling, performance, and pricing.
One of the more interesting points raised was the concept of correlated hardware failures, which is something I didn’t realize existed via supply chains when operating at distributed scale.
From the article:
Hard drives (and now SSDs) have been known to occasionally have correlated failures: if you see one disk fail, you are a lot more likely to see a second failure before getting back up if your disks are from the same manufacturing batch. Services like Backblaze overcome this by using many different models of disks from multiple manufacturers.
If you are using a hosting provider which rents pre-built servers, it is prudent to rent two different types of servers in each of your primary and backup datacenters. This should avoid almost every failure mode present in modern systems.
I never considered this before, and it would then make sense to try to cater for differences across providers such as:
Forgive the code, sometimes it’s easier to express logic using programming syntax than with paragraphs.
One thing that the blog post made very clear was the level of redundancy and reproducibility offered by cloud providers themselves, which usually offsets edge-cases involving correlated hardware failures:
[…] your cloud provider has so much experience building servers that you don’t even see most failures, and for the other failures, you can get back up and running really quickly by renting a new machine in their nearly-limitless pool of compute. It is their job to make sure that you don’t experience downtime, and while they don’t always do it perfectly, they are pretty good at it.
Learning about this concept (even if cloud providers control for it) is really useful when it comes to thinking about the many ways that otherwise distributed systems can centrally fail.