Disaster Recovery in the Cloud, Pragmatically
Disaster recovery no longer needs a second data centre. How to build a cloud DR plan around real recovery targets, not wishful thinking.
Recovery is a decision, not a product
Disaster recovery in the cloud is often sold as a checkbox — replicate everything, somewhere, and feel safe. But disaster recovery is fundamentally a set of decisions about how much downtime and data loss your business can actually tolerate, and what you are willing to pay to reduce them. Buy the tooling before you make those decisions and you will overspend on things that do not matter and underspend on the few that do.
The cloud genuinely changed the economics here. You no longer need a mirrored physical site sitting idle. But cheaper infrastructure does not remove the need for a real plan.
Two numbers anchor everything: RTO and RPO
Every DR conversation should start with two targets, set per system rather than for the company as a whole:
- Recovery Time Objective (RTO) — how long you can be down before it genuinely hurts. Minutes for a payment system, perhaps a day for an internal wiki.
- Recovery Point Objective (RPO) — how much recent data you can afford to lose. Near-zero for transactions, maybe a few hours for analytics.
These numbers drive cost. Tight RTO and RPO mean hot standby and continuous replication, which is expensive. Looser targets allow cheaper backup-and-restore. Setting them honestly, system by system, is the single most valuable step in the whole exercise.
Match the strategy to the tier
Not every system deserves the same protection, and pretending otherwise wastes money. We usually sort workloads into tiers. Backup and restore is the cheapest tier: regular backups to durable cloud storage, restored on demand, fine for systems that can tolerate hours of downtime. Warm standby keeps a scaled-down copy running that you scale up during a failover — a sensible middle ground. Hot standby or multi-region active-active is for the handful of systems where even minutes of downtime are unacceptable; it is powerful and genuinely costly, so reserve it for what truly warrants it.
A backup you have never restored is a hope, not a plan. The restore is the product; the backup is just raw material.
Test, because untested DR usually fails
The uncomfortable truth is that most DR plans fail their first real test. Backups turn out to be incomplete, restore procedures are undocumented, or the one person who knew the steps has left. The fix is dull but decisive: schedule regular restore drills. Actually bring a system back from backup into an isolated environment and time it against your RTO. The first drill is always humbling, and that is exactly why it is worth doing before a real incident forces the issue.
Common cloud DR mistakes
- Replicating data but not the configuration and infrastructure needed to use it.
- Keeping backups in the same region or account as production, so one failure takes both.
- Forgetting that a DR plan also needs people, runbooks, and access that work when the main systems are down.
- Assuming the cloud provider's built-in durability is the same thing as your own recovery plan.
Keep it documented and current
Infrastructure as code makes cloud DR far more reliable, because your recovery environment can be rebuilt from definitions rather than memory. Pair that with a short, current runbook — who declares an incident, where backups live, the restore steps, how to verify success — and store it somewhere reachable when production is unreachable.
Disaster recovery is not the same as high availability
These two terms get blurred, and the confusion leads to gaps. High availability is about surviving small, expected failures without anyone noticing — a server dies, traffic shifts, the service stays up. Disaster recovery is about surviving large, rare events that take out whole regions, accounts, or datasets, where some interruption is accepted in exchange for getting the business back. You need both, and they are not substitutes. A beautifully redundant single-region setup still goes dark if that region or your account is compromised, which is exactly the scenario DR exists to cover.
Account for the human side of recovery
Technology aside, real incidents are run by tired people under pressure, often at an inconvenient hour. Plans that assume calm, fully informed operators tend to fail. So decide in advance who has the authority to declare a disaster and trigger failover, make sure more than one person can perform the critical steps, and confirm that the access and credentials needed for recovery do not themselves depend on the systems that are down. A runbook that lives only in the head of someone on leave is not a plan.
How BSH can help
BSH Technologies designs pragmatic cloud disaster recovery on GCP and AWS, starting from honest RTO and RPO targets and ending with restore drills that prove it works. If your current plan has never actually been tested, that is the first thing we would help you fix.
From the blog
View all postsDesigning Multi-Tenant SaaS That Scales
Choosing an isolation model, keeping tenant data separate, and dodging the noisy-neighbour and migration traps that bite SaaS later.
Hitting Green Core Web Vitals in Next.js
A practical guide to LCP, INP and CLS in Next.js — image handling, font loading, the App Router boundary, and costly third-party scripts.