07 Dec 2022  notes

Correlated Hardware Failures

I was reading a blog post from Speculative Branches that touched on cloud computing topics relating to vertical vs horizontal scaling, performance, and pricing.

One of the more interesting points raised was the concept of correlated hardware failures, which is something I didn’t realize existed via supply chains when operating at distributed scale.

From the article:

Hard drives (and now SSDs) have been known to occasionally have correlated failures: if you see one disk fail, you are a lot more likely to see a second failure before getting back up if your disks are from the same manufacturing batch. Services like Backblaze overcome this by using many different models of disks from multiple manufacturers.


If you are using a hosting provider which rents pre-built servers, it is prudent to rent two different types of servers in each of your primary and backup datacenters. This should avoid almost every failure mode present in modern systems.

I never considered this before, and it would then make sense to try to cater for differences across providers such as:

function chooseWisely(provider) {
  switch (provider){
    case 'digitalocean':
      return [
        'Intel vs. AMD CPUs (hardware may be different)',
        'Different Datacenter Regions (supply chain may differentiate hardware)',
        'Different Datacenter Same Region (newer datacenters may have different hardware)',
    case 'aws':
      return [
        'Different EC2 Types (hardware may vary for compute/memory optimized instances, etc.)',
        'Different Regions (supply chain may differentiate hardware)',
    case 'generic-dedicated':
      return [
        'Different Dedicated Server Setup (CPU families, SSD vs. NVME, etc.)',
        'Different Networking Loations (IPMI subnets, ASN numbers, etc.)'
    case 'generic-vps':
      return [
        'Validating Hypervisor Differences (checking the VM cores, neighbors, etc.)',
        'Different Datacenter Locations (supply chain may differentiate hardware)'
      return [
        'Using Different Cloud Providers Entirely (at least for non-meshed services)'

One thing that the blog post made very clear was the level of redundancy and reproducibility offered by cloud providers themselves, which usually offsets edge-cases involving correlated hardware failures:

[…] your cloud provider has so much experience building servers that you don’t even see most failures, and for the other failures, you can get back up and running really quickly by renting a new machine in their nearly-limitless pool of compute. It is their job to make sure that you don’t experience downtime, and while they don’t always do it perfectly, they are pretty good at it.

Learning about this concept (even if cloud providers control for it) is really useful when it comes to thinking about the many ways that otherwise distributed systems can centrally fail.

Copyright © Paramdeo Singh · All Rights Reserved · Built with Jekyll

This node last updated November 7, 2023 and is permanently morphing...

Paramdeo Singh Guyana

Generalist. Edgerunner. Riding the wave of consciousness in this treacherous mortal sea.

Technology Design Strategy Literature Personal Blogs
Search Site

Results are from Blog, Link Dumps, and #99Problems