Keeping a Core Business Platform Online During Regional Cloud Instability
Stabilised a core business platform during regional cloud instability , moved critical infrastructure to Frankfurt, migrated large asset storage with AWS DataSync, and brought response times on key operations down from minutes to milliseconds

LIVE SIGNAL · CLOUD RECOVERY · V1.0 · PROD
0103
In plain English
Kept a live CRM/ERP platform usable through regional cloud instability — moved critical infrastructure, validated the full production workflow, and reduced downtime risk without pausing daily operations.
Business value
- Kept daily operations running during regional instability.
- Reduced risk by validating the full production path.
- Improved platform reliability without forcing teams to stop work.
- Protected access to critical CRM/ERP workflows used by internal teams.
- Created a safer foundation for disaster recovery.
- Role
- Production Recovery / Cloud Resilience
- Design · Build · Ship
- Timeframe
- February 2026 – March 2026
- Domain
- Real Estate System / Cloud Infrastructure
- Category
- Cloud Infrastructure
- Moved critical infrastructure to a safer AWS region while keeping production workflows usable.
- Validated the full production path across EC2, RDS, S3, DNS, Cloudflare, Nginx, and app workflows.
- Migrated large platform assets with AWS DataSync instead of risky manual file movement.
- Reduced response times from minutes to milliseconds after stabilising the recovery path.
- Kept business operations running without forcing teams into downtime or manual workarounds.
- Moved critical infrastructure to a safer AWS region while keeping production workflows usable.
- Validated the full production path across EC2, RDS, S3, DNS, Cloudflare, Nginx, and app workflows.
- Migrated large platform assets with AWS DataSync instead of risky manual file movement.
- Reduced response times from minutes to milliseconds after stabilising the recovery path.
- Kept business operations running without forcing teams into downtime or manual workarounds.
Regional Cloud Instability
The core business platform had to keep running through a period of regional cloud instability that put system availability, storage access, and daily operations at risk.
Because the platform supports listings, sales, leads, call center activity, operational workflows, and internal business processes, even short periods of slowness or broken asset access could affect multiple teams. The priority was to keep the system usable while moving critical services away from the affected region.
This was not a cosmetic change. The platform is part of the company’s daily operating layer, so latency, broken file access, or unstable routing could directly affect listings, lead handling, agent productivity, and management visibility.
The work focused on stabilising production access, validating AWS infrastructure, restoring reliable access to uploaded assets, and making sure teams could continue using the system with minimal disruption.
Production Risk During Regional Instability
The challenge wasn't only performance. The platform had to stay usable while the regional infrastructure was unstable — and the migration had to happen without breaking production workflows.
The system depended on multiple AWS services working together: EC2 for the application layer, RDS MySQL for the production database, S3 for uploaded files and listing images, AWS DataSync for large asset movement, VPC networking for secure connectivity, Route 53 for DNS routing, Cloudflare for external access protection, and Nginx for application traffic handling.
A single broken dependency could take down agents' work, listings, images, and records all at once. The goal was to recover performance, preserve access to uploaded assets, reduce exposure to the affected region, and validate the full production path before teams resumed normal work.
Ownership
Everything I designed, built, and was accountable for.
Operations
- Recovery documentation
Additional scope
- Infrastructure recovery planning
- Production migration coordination
- EC2, RDS, and S3 validation
- AWS DataSync transfer verification
- Route 53, Cloudflare, and Nginx checks
- Application smoke testing
- Latency and availability validation
Key decisions
The calls I made, what I rejected, and why: these are the tradeoffs that shaped the system.
Move critical production services away from the affected region first.
Waiting for the regional instability to settle before making infrastructure changes.
The platform supported daily operations across listings, leads, sales, call center activity, and internal workflows, so waiting introduced more risk than acting. The priority was to keep the system usable by shifting the most important production path to a more stable region while validating each dependency carefully.
A platform is not stable just because the server is online. It is stable when the business can keep working without thinking about the infrastructure underneath it.
The Recovery Approach
I led the recovery of the core business platform: moved key services to Frankfurt and validated the full production path across application, database, storage, networking, and routing layers.
The setup involved EC2 for the application/API layer, RDS MySQL for the production database, S3 buckets for uploaded platform assets, AWS DataSync for moving large existing asset data, VPC configuration for secure service connectivity, Route 53 for DNS routing, Cloudflare for external access protection, and Nginx for application traffic handling.
After the migration, I validated platform workflows, verified asset paths, measured latency, confirmed production access, and resolved issues around storage migration. This helped ensure the system was not only moved, but actually usable for real business operations after the cutover.
0104
monitoring · validation · recovery loop
Backend
Database
Infrastructure
Also used
The goal was not to redesign the cloud architecture overnight. The goal was to keep production alive, validate every dependency, and reduce risk one layer at a time.
0104
What Stayed Online
- Regional failover completed
- Large S3 assets migrated with DataSync
- Production CRM/ERP stabilised
- Latency reduced from 3m to 100ms
The recovery restored platform usability and cut response time on key operations from roughly 3 minutes to around 300ms.
Production users were able to continue normal system work, including editing records, accessing listings, and using core modules. Critical infrastructure was stabilised in Frankfurt, while AWS DataSync helped migrate large platform asset storage and preserve access to uploaded files.
The work improved resilience by reducing dependency on the affected region and creating a clearer path for future disaster recovery planning.
Response time: ~3 minutes → ~300 milliseconds.
0106
Minutes → milliseconds
Critical services moved
Safer asset transfer
App, DB, storage, DNS, CDN checked
No planned business pause
Lower production risk
The work kept daily CRM/ERP operations running during regional cloud instability. Critical infrastructure, storage, routing, and application paths were moved and validated without forcing teams into a business pause.
“Rusty understands the difference between adding features and making software actually usable. He looks at how people work, finds the friction, and improves the system in a way that makes daily operations feel smoother.”
Operations Stakeholder
Internal Platform Team — name under NDA
What I’d Tell Anyone Responsible for a Live Platform
Next time I'd prepare a formal disaster recovery runbook much earlier. It would cover checks across multiple availability zones, backup validation across regions, DataSync task planning, DNS cutover steps, rollback plans, and a full round of testing once the migration is done.
The point isn't documentation for its own sake. A runbook turns the next regional incident into a process you follow instead of a problem you solve under pressure. It shortens response time, lowers the risk of missing a dependency, and means recovery doesn't rest on one person remembering every step.|
Stable infrastructure isn't only about how you build it. It's about how quickly and calmly you can bring it back when something outside your control goes wrong.
Uptime is not just whether the server responds. It is whether the business can keep working without noticing the infrastructure underneath it.
Stable first. Optimised later. Production kept moving.
8+
3
5+
10+
Need a safer platform before the next incident?
I help teams stabilise live systems, validate infrastructure, and reduce production risk without forcing the business to stop.
Start a Project