Cloud InfrastructureProduction Recovery / Cloud ResilienceReal Estate System / Cloud Infrastructure

Case study · 03February 2026 – March 2026Featured

Keeping a Core Business Platform Online During Regional Cloud Instability

Stabilised a core business platform during regional cloud instability , moved critical infrastructure to Frankfurt, migrated large asset storage with AWS DataSync, and brought response times on key operations down from minutes to milliseconds

By Rusty Lopez

4 min read1,654 viewsAbu Dhabi, UAE

LIVE SIGNAL · CLOUD RECOVERY · V1.0 · PROD

Inbound leads01

CRM Latency3 min

Asset Accesss3 me

Databaserds

Routingdns

Scoring engine02

EC290

RDS84

DataSync80

Route 5372

Prioritised queue03

01Frankfurt EC20

02Frankfurt RDS0

03Migrated Assets0

04Production CRM0

0103

Workflow score87 / 100

In plain English

Kept a live CRM/ERP platform usable through regional cloud instability — moved critical infrastructure, validated the full production workflow, and reduced downtime risk without pausing daily operations.

Business value

Kept daily operations running during regional instability.
Reduced risk by validating the full production path.
Improved platform reliability without forcing teams to stop work.
Protected access to critical CRM/ERP workflows used by internal teams.
Created a safer foundation for disaster recovery.

Kept daily operations running during regional instability.
Reduced risk by validating the full production path.
Improved platform reliability without forcing teams to stop work.
Protected access to critical CRM/ERP workflows used by internal teams.
Created a safer foundation for disaster recovery.

Project snapshot

Role: Production Recovery / Cloud Resilience; Design · Build · Ship
Timeframe: February 2026 – March 2026
Domain: Real Estate System / Cloud Infrastructure
Category: Cloud Infrastructure

01Role

Production Recovery / Cloud Resilience

Design · Build · Ship

02Timeframe

February 2026 – March 2026

03Domain

Real Estate System / Cloud Infrastructure

04Category

Cloud Infrastructure

TL;DR

Moved critical infrastructure to a safer AWS region while keeping production workflows usable.
Validated the full production path across EC2, RDS, S3, DNS, Cloudflare, Nginx, and app workflows.
Migrated large platform assets with AWS DataSync instead of risky manual file movement.
Reduced response times from minutes to milliseconds after stabilising the recovery path.
Kept business operations running without forcing teams into downtime or manual workarounds.

Moved critical infrastructure to a safer AWS region while keeping production workflows usable.
Validated the full production path across EC2, RDS, S3, DNS, Cloudflare, Nginx, and app workflows.
Migrated large platform assets with AWS DataSync instead of risky manual file movement.
Reduced response times from minutes to milliseconds after stabilising the recovery path.
Kept business operations running without forcing teams into downtime or manual workarounds.

Regional Cloud Instability

The core business platform had to keep running through a period of regional cloud instability that put system availability, storage access, and daily operations at risk.

Because the platform supports listings, sales, leads, call center activity, operational workflows, and internal business processes, even short periods of slowness or broken asset access could affect multiple teams. The priority was to keep the system usable while moving critical services away from the affected region.

This was not a cosmetic change. The platform is part of the company’s daily operating layer, so latency, broken file access, or unstable routing could directly affect listings, lead handling, agent productivity, and management visibility.

The work focused on stabilising production access, validating AWS infrastructure, restoring reliable access to uploaded assets, and making sure teams could continue using the system with minimal disruption.

Production Risk During Regional Instability

The challenge wasn't only performance. The platform had to stay usable while the regional infrastructure was unstable — and the migration had to happen without breaking production workflows.

The system depended on multiple AWS services working together: EC2 for the application layer, RDS MySQL for the production database, S3 for uploaded files and listing images, AWS DataSync for large asset movement, VPC networking for secure connectivity, Route 53 for DNS routing, Cloudflare for external access protection, and Nginx for application traffic handling.

A single broken dependency could take down agents' work, listings, images, and records all at once. The goal was to recover performance, preserve access to uploaded assets, reduce exposure to the affected region, and validate the full production path before teams resumed normal work.

Ownership

Everything I designed, built, and was accountable for.

Operations

Recovery documentation

Additional scope

Infrastructure recovery planning
Production migration coordination
EC2, RDS, and S3 validation
AWS DataSync transfer verification
Route 53, Cloudflare, and Nginx checks
Application smoke testing
Latency and availability validation

Infrastructure recovery planning
Production migration coordination
EC2, RDS, and S3 validation
AWS DataSync transfer verification
Route 53, Cloudflare, and Nginx checks
Application smoke testing
Latency and availability validation
Recovery documentation

Key decisions

The calls I made, what I rejected, and why: these are the tradeoffs that shaped the system.

Chose

Move critical production services away from the affected region first.

Rejected

Waiting for the regional instability to settle before making infrastructure changes.

Rationale

The platform supported daily operations across listings, leads, sales, call center activity, and internal workflows, so waiting introduced more risk than acting. The priority was to keep the system usable by shifting the most important production path to a more stable region while validating each dependency carefully.

Chose

Move critical production services away from the affected region first.

Rejected

Waiting for the regional instability to settle before making infrastructure changes.

Rationale

Chose

Validate the full production path, not just whether the server was online.

Rejected

Treating the migration as complete once EC2 and RDS were reachable.

Rationale

A working server does not mean a working business platform. The system depended on application traffic, database access, uploaded assets, DNS routing, Cloudflare, Nginx, and user workflows all working together. The migration only counted as successful once real platform actions could run reliably again.

Chose

Use AWS DataSync to move large platform assets safely.

Rejected

Manually copying uploaded files or ignoring asset migration until later.

Rationale

Uploaded files and listing images were part of the live business workflow, not optional media. Broken asset access could affect listings, records, and user trust. Using DataSync created a safer path for moving large existing storage while preserving access to the files teams needed.

A platform is not stable just because the server is online. It is stable when the business can keep working without thinking about the infrastructure underneath it.

The Recovery Approach

I led the recovery of the core business platform: moved key services to Frankfurt and validated the full production path across application, database, storage, networking, and routing layers.

The setup involved EC2 for the application/API layer, RDS MySQL for the production database, S3 buckets for uploaded platform assets, AWS DataSync for moving large existing asset data, VPC configuration for secure service connectivity, Route 53 for DNS routing, Cloudflare for external access protection, and Nginx for application traffic handling.

After the migration, I validated platform workflows, verified asset paths, measured latency, confirmed production access, and resolved issues around storage migration. This helped ensure the system was not only moved, but actually usable for real business operations after the cutover.

Fig. 01 · Cross-region CRM recovery pipelineAWS migration + DataSync asset transfer

01 · Risk

Regional Instability

3m Latency

S3 Asset Access

Prod CRM Impact

02 · Recovery

EC2 Frankfurt

RDS Validation

AWS DataSync

VPC Review

03 · Routing

Route 53

DNS Cutover

Nginx Config

Cloudflare

04 · Output

100ms Response

Asset Recovery

Stable CRM

Business Continuity

01 · Risk

Regional Instability

3m Latency

S3 Asset Access

Prod CRM Impact

02 · Recovery

EC2 Frankfurt

RDS Validation

AWS DataSync

VPC Review

03 · Routing

Route 53

DNS Cutover

Nginx Config

Cloudflare

04 · Output

100ms Response

Asset Recovery

Stable CRM

Business Continuity

0104

monitoring · validation · recovery loop

Full stack

AWS EC2
AWS RDS
Amazon S3
S3 Cross-Region Replication
VPC
Apache
AWS DataSync
MySQL
Cloudflare
Laravel

Backend

Laravel

Database

AWS RDSMySQL

Infrastructure

AWS EC2Amazon S3S3 Cross-Region ReplicationVPCAWS DataSyncCloudflare

Also used

Apache

Core stack

AWS
EC2
RDS
S3
Route 53
VPC
DataSync

AWSEC2RDSS3Route 53VPCDataSync

The goal was not to redesign the cloud architecture overnight. The goal was to keep production alive, validate every dependency, and reduce risk one layer at a time.

Gallery

Cloud infrastructure recovery visual showing regional cloud movement and stability planning. — Critical services were moved toward a more stable recovery path instead of waiting for the affected region to settle.

Production validation checklist covering app, database, storage, DNS, CDN, and workflow checks. — The recovery was validated across the full production path: application, database, assets, DNS, CDN, and real workflows.

Before and after comparison showing platform response time improvement from minutes to milliseconds. — Response times moved from minutes to milliseconds after the recovery path was stabilised and validated.

AWS DataSync asset migration visual for moving large platform files safely. — Large platform assets were moved through AWS DataSync to reduce the risk of manual file migration during recovery.

0104

What Stayed Online

Outcome summary

Regional failover completed
Large S3 assets migrated with DataSync
Production CRM/ERP stabilised
Latency reduced from 3m to 100ms

The recovery restored platform usability and cut response time on key operations from roughly 3 minutes to around 300ms.

Production users were able to continue normal system work, including editing records, accessing listings, and using core modules. Critical infrastructure was stabilised in Frankfurt, while AWS DataSync helped migrate large platform asset storage and preserve access to uploaded files.

The work improved resilience by reducing dependency on the affected region and creating a clearer path for future disaster recovery planning.

Response time: ~3 minutes → ~300 milliseconds.

Response time

MinutesMilliseconds

Minutes → milliseconds

Infrastructure region

Higher-risk regional pathFrankfurt recovery path

Critical services moved

Asset storage

Large assets at risk during migrationMigrated with AWS DataSync

Safer asset transfer

Production validation

Server-level checksFull workflow validation

App, DB, storage, DNS, CDN checked

Business continuity

Potential operational pauseDaily workflows kept usable

No planned business pause

Recovery confidence

Region-dependent setupValidated multi-layer recovery path

Lower production risk

0106

Response time

MinutesMilliseconds

Minutes → milliseconds

Infrastructure region

Higher-risk regional pathFrankfurt recovery path

Critical services moved

Asset storage

Large assets at risk during migrationMigrated with AWS DataSync

Safer asset transfer

Production validation

Server-level checksFull workflow validation

App, DB, storage, DNS, CDN checked

Business continuity

Potential operational pauseDaily workflows kept usable

No planned business pause

Recovery confidence

Region-dependent setupValidated multi-layer recovery path

Lower production risk

The work kept daily CRM/ERP operations running during regional cloud instability. Critical infrastructure, storage, routing, and application paths were moved and validated without forcing teams into a business pause.

“Rusty understands the difference between adding features and making software actually usable. He looks at how people work, finds the friction, and improves the system in a way that makes daily operations feel smoother.”

Operations Stakeholder

Internal Platform Team — name under NDA

What I’d Tell Anyone Responsible for a Live Platform

Final note

Next time I'd prepare a formal disaster recovery runbook much earlier. It would cover checks across multiple availability zones, backup validation across regions, DataSync task planning, DNS cutover steps, rollback plans, and a full round of testing once the migration is done.

The point isn't documentation for its own sake. A runbook turns the next regional incident into a process you follow instead of a problem you solve under pressure. It shortens response time, lowers the risk of missing a dependency, and means recovery doesn't rest on one person remembering every step.|

Stable infrastructure isn't only about how you build it. It's about how quickly and calmly you can bring it back when something outside your control goes wrong.

Uptime is not just whether the server responds. It is whether the business can keep working without noticing the infrastructure underneath it.

Stable first. Optimised later. Production kept moving.

Share this case study

Projects shipped

Industries served

Years building

10+

Tech specialties

Let's work together

Need a safer platform before the next incident?

I help teams stabilise live systems, validate infrastructure, and reduce production risk without forcing the business to stop.

Start a Project

Next case study

Building Software People Actually Use

A practical look at improving a live internal platform by making it faster, clearer, and easier for the people who use it every day.

All work