A switch failed and what did we do? Nothing

The switch in question is definitely dead. No lights. No fans. Unresponsive.

Each Brightbox Cloud rack switch carries public and private network connectivity for around 20 physical hosts. Each host may contain up to a hundred or more cloud servers. But none of our customers were affected.

The reason we didn’t have to panic is that sitting right next to the dead switch was another identical switch that was alive and well. All hosts connected to the dead switch were also simultaneously connected to the healthy switch.

Each physical host at Brightbox has at least two network interfaces, which are connected to two different switch devices in parallel, and use link aggregation to behave like a single interface. Should an individual interface lose its link, traffic is rebalanced across the others within milliseconds.

This means that when a network interface fails, or a switch fails or maybe even a cable fails, our customers aren’t affected.

We received an alert from our monitoring system and were able to replace the switch with a spare within a few hours without any interruption to service!

N+1 hardware

Of course, switch failures like this are pretty rare, but with enough switches and network cards you’ll see them often enough. From day one of Brightbox (back in 2007!) we’ve had an “N+1” policy for hardware - N being the minimum number of devices needed at any point to provide a service, and +1 means we run (at least) one more than that.

We do the same for hard disks - we use RAID10 and RAID6 to ensure that a hard disk failure doesn’t cause outages. Our hosts have dual power supplies and each power supply is connected to a separate power circuit, which are connected to separate battery backed power supplies (and finally diesel generators).

Our internet connectivity isn’t any different: multiple points of connection from different upstream suppliers.

Many of these techniques are common industry-standards for serious cloud infrastructure. Budget cloud providers can save some money by cutting corners, but if uptime is important to you, this is what you need.

But at Brightbox, we actually go even further…

N+1 datacentres

We have two entirely separate UK datacentres (or Zones). Our zones are connected by multiple high speed, diversely routed links which provides both load balancing and redundancy.

So, the reason we don’t have to care so much when something like a switch fails is that we designed our system from the ground up to withstand it.

Serious about uptime?

If you’re serious about your uptime, then give Brightbox Cloud a whirl. You can sign up in 2 minutes and build a cloud server in 30 seconds. We’ll also give you an automatic £20 free credit to start you off!

Photo credit: Andrewfhart / Flickr

Recent posts

Get started with Brightbox Sign up takes just two minutes...