Correlated Hardware Failures

I was reading a blog post from Speculative Branches that touched on cloud computing topics relating to vertical vs horizontal scaling, performance, and pricing.

One of the more interesting points raised was the concept of correlated hardware failures, which is something I didn’t realize existed via supply chains when operating at distributed scale.

From the article:

Hard drives (and now SSDs) have been known to occasionally have correlated failures: if you see one disk fail, you are a lot more likely to see a second failure before getting back up if your disks are from the same manufacturing batch. Services like Backblaze overcome this by using many different models of disks from multiple manufacturers.

Further:

If you are using a hosting provider which rents pre-built servers, it is prudent to rent two different types of servers in each of your primary and backup datacenters. This should avoid almost every failure mode present in modern systems.

I never considered this before, and it would then make sense to try to cater for differences across providers such as:

function chooseWisely(provider) {
  switch (provider){
    case 'digitalocean':
      return [
        'Intel vs. AMD CPUs (hardware may be different)',
        'Different Datacenter Regions (supply chain may differentiate hardware)',
        'Different Datacenter Same Region (newer datacenters may have different hardware)',
      ]
    case 'aws':
      return [
        'Different EC2 Types (hardware may vary for compute/memory optimized instances, etc.)',
        'Different Regions (supply chain may differentiate hardware)',
      ]
    case 'generic-dedicated':
      return [
        'Different Dedicated Server Setup (CPU families, SSD vs. NVME, etc.)',
        'Different Networking Loations (IPMI subnets, ASN numbers, etc.)'
      ]
    case 'generic-vps':
      return [
        'Validating Hypervisor Differences (checking the VM cores, neighbors, etc.)',
        'Different Datacenter Locations (supply chain may differentiate hardware)'
      ]
    default:
      return [
        'Using Different Cloud Providers Entirely (at least for non-meshed services)'
      ]
  }
}

One thing that the blog post made very clear was the level of redundancy and reproducibility offered by cloud providers themselves, which usually offsets edge-cases involving correlated hardware failures:

[…] your cloud provider has so much experience building servers that you don’t even see most failures, and for the other failures, you can get back up and running really quickly by renting a new machine in their nearly-limitless pool of compute. It is their job to make sure that you don’t experience downtime, and while they don’t always do it perfectly, they are pretty good at it.

Learning about this concept (even if cloud providers control for it) is really useful when it comes to thinking about the many ways that otherwise distributed systems can centrally fail.

07 Dec 2022 notes

Correlated Hardware Failures

Links

Back to Blog Index

Recent Posts

Testing an Imperative Loop versus Higher Order Functions in JavaScript

Rebuilding StatusCodes

Inner Source

Avoiding Null in TypeScript

Enabling SSH Login Notifications using Ntfy

Restricting Netlify's Default Subdomain for Security and SEO

Collapsing Bootstrap's Dropdown Navigation Automatically in SvelteKit

Validating UUIDs with Regular Expressions in JavaScript

Making a Weekly Habit Calendar with Bootstrap and JavaScript

Creating Disclaimers for Old Blog Posts in Jekyll

Search Site