Manufacturing plants pay six-figure costs for every hour of unplanned downtime — and an on-prem AI brain that goes dark at the wrong moment can cascade losses across maintenance, quality, materials, and production planning all at once. This guide is the working SRE plan for an on-prem AI system inside a plant. The failover patterns. The disaster recovery architecture. The backup strategy. The test cadence. Together they deliver 99.9% uptime as a real measurable number, not a marketing one. Production-tested across cement, steel, pharma, and FMCG plants running on NVIDIA DGX hardware, with end-to-end run-time observability and an explicit MTTR budget for every failure class.
On-Prem AI Server Reliability — Failover, DR and Backup Complete Guide
The failover patterns, disaster recovery architecture, and backup strategy that turn 99.9% uptime into a real number rather than a marketing one. Built around NVIDIA DGX, tested across plants, and shipped with explicit MTTR budgets per failure class.
What You Get — Turnkey Reliability Package
Redundant hardware, HA software, DR runbook, and 24×7 monitoring — pre-configured and shipped. Plug power and Ethernet. The reliability stack is live alongside the AI brain from day one.
Redundant Hardware
NVIDIA DGX HA pair, dual PSU, ECC RAM, NVMe RAID — pre-racked, pre-cabled.
HA Software
Active-passive cluster manager, health checks, automatic failover, leader election.
DR Runbook
Failover scripts, drill procedures, RTO/RPO targets per service, recovery checklists.
24×7 Monitoring
Health probes, alerts, on-call rotation, SLA-backed response times, monthly reports.
What Each "Nine" Actually Costs You in Downtime
Uptime SLAs sound like marketing. They are math. The four steps below show exactly how much downtime each tier allows per year — and which use cases genuinely need each tier. iFactory's standard package targets 99.9% (8.76 hours per year) and routinely measures above it.
Acceptable only for non-production, lab, or training environments. Far too loose for any plant-floor AI service.
iFactory's standard target. Sufficient for predictive maintenance, copilot Q&A, vision QM, and most operator workflows.
Upgrade tier. Required for line-blocking decisions — usage decisions on continuous lines, real-time process control loops.
Gold standard. Achievable with active-active geo-distributed clusters. Reserved for safety-critical control loop integrations.
Five Tiers of Redundancy — Pick the One That Matches Your Tolerance
Reliability is not a single dial — it is a stack of decisions. The tiers below progress from "no redundancy" to "geographically distributed active-active". Each tier raises cost and raises uptime. iFactory ships at Tier 3 (2N) as standard and offers all five.
N — Single Node
One server. Single point of failure. Any hardware fault causes downtime. Acceptable only for development environments or non-critical workloads.
N+1 — One Spare
Production node plus one passive spare. Manual cutover. Survives one hardware failure but not while a spare is being repaired.
2N — Full Duplicate
Two production nodes, automatic failover, leader election. iFactory's standard for plant AI brain. Survives one node loss with no manual intervention.
2N+1 — Duplicate Plus Spare
Active HA pair plus a third spare. Survives a node failure even when one node is in maintenance. Recommended for line-blocking AI loops.
Geo-DR — Across Sites
Primary plus DR site at separate physical location, replicated in near-real-time. Survives site-level outages — power, network, building damage, region-level events.
What Happens in the First 30 Seconds — Second by Second
The diagram below is a real failover sequence from a Tier-3 deployment. The numbers are the iFactory defaults. They are tunable per workload — for line-blocking services we tighten the health-check window to 1.5 seconds.
Primary fault occurs
GPU error, NVMe failure, kernel panic, or network partition. Primary stops responding to traffic.
Health check fails
Three consecutive missed heartbeats (1s interval). Cluster manager flags primary as unhealthy.
Leader election
Secondary promoted via consensus. New leader signals load balancer. Traffic begins rerouting.
Secondary fully active
All AI services (PdM, vision QM, copilot, agents) running on secondary. Pending writes drained from queue.
Operations resumed
Operator queries respond. Sensor ingestion caught up. SRE on-call paged automatically with failure forensics attached.
The 3-2-1 Backup Rule, Applied to Plant AI
Three copies of every critical artifact — model weights, vector databases, configuration, SAP cache. Two different storage media types. One copy off-site. The 3-2-1 rule has held up for 25 years and remains iFactory's default backup posture.
Copies of Data
The original on primary, a second on secondary, and a third in the DR site. Snapshots every 15 minutes, full backups daily.
- Model weights and adapters
- Vector database and indexes
- SAP cache and historian replicas
- Configuration and authorization data
Different Media Types
Hot NVMe storage for active workload and immediate recovery. Cold object storage for long-term retention. Different failure modes per medium.
- NVMe RAID 10 — primary working set
- S3-compatible object store — cold tier
- Separate hardware controllers per tier
- Independent power supplies
Copy Off-Site
One copy at a separate physical location — DR site, neighboring plant, or air-gapped vault. Survives site-level events that take out both primary and secondary.
- Encrypted in transit and at rest
- Geographically separated from primary
- Quarterly recovery test from off-site
- Air-gap variant for regulated workloads
Primary Plus DR — Continuous Replication, Drill-Tested Quarterly
The DR site is not a "we'll figure it out" plan. It is a fully provisioned mirror of the primary, replicated in near-real-time, drill-tested every quarter, and exercised end-to-end annually with a full controlled failover.
Plant Site
Inside the manufacturing facility — close to PLCs, cameras, sensors, and the SAP application servers. Sub-50ms latency to every signal source.
- NVIDIA DGX HA pair (2N)
- SAP RFC destination, OData gateway
- PLC and sensor live ingestion
- All operator copilots online here
DR Site
Geographically separate — corporate data center, neighboring facility, or colocation. Identical hardware and software, ready to take production traffic on declared failover.
- NVIDIA DGX mirror cluster
- SAP DR endpoint pre-configured
- Replicated model state and vector DB
- Promotable within RTO budget
Recovery Time and Recovery Point — Per Service
RTO is how long you can be down. RPO is how much data you can lose. Both are decided per service — not blanket-set for the whole system. The matrix below shows iFactory's standard targets across the AI services.
| Service | Tier | RTO Target | RPO Target | How It Is Achieved |
|---|---|---|---|---|
| Operator copilot | Critical | 30 sec | 0 sec | HA pair with leader election, sticky session retry |
| Vision QM | Critical | 30 sec | 0 sec | HA pair, edge buffering during transition |
| Predictive maintenance | High | 2 min | 15 min | HA pair, time-series replay from sensors |
| SAP write-back queue | High | 1 min | 0 sec | bgRFC queue persistence, exactly-once semantics |
| Historian ingestion | Medium | 5 min | 5 min | Auto-resume from last checkpoint |
| Model retraining | Low | 4 hours | 24 hours | Batch job, idempotent restart from snapshot |
| Reporting and analytics | Low | 8 hours | 24 hours | Cold-tier restore, regenerated from raw |
Every Failure Class, Catalogued With Its Mitigation
An honest SRE plan starts with naming every plausible failure. The catalog below is the iFactory cheat sheet — color-coded by severity, paired with the mitigation that is shipped on by default.
GPU failure
Mitigation — HA pair failover within 30 sec. Failed GPU replaced under DGX service contract; cluster continues on partner node.
Node loss — full
Mitigation — Cluster manager promotes secondary. All AI services resume on the surviving node within MTTR budget.
NVMe / disk failure
Mitigation — RAID 10 absorbs single-disk loss without service impact. Hot-swap during maintenance window.
Power supply failure
Mitigation — Dual PSU configuration. Surviving PSU carries full load; failed unit replaced without downtime.
Network partition
Mitigation — Quorum-based leader election prevents split-brain. Edge buffering on sensor side preserves data during partition.
OOM kill / process crash
Mitigation — Service supervisor restarts crashed process within seconds. Memory pressure escalates to scale event.
Data corruption
Mitigation — Checksums on every snapshot. Last-good restore from cold tier; integrity verified before promotion.
Site outage
Mitigation — DR site promoted via runbook. RTO 5–30 min depending on workload. Quarterly drill-tested.
Reliability That Is Never Tested Is Not Reliability
iFactory ships a drill schedule that exercises every layer of redundancy at increasing scope. Drills happen on actual production hardware against the actual model and data state — not a parallel environment.
Single-Service Failover Drill
Pick one service, force failover, measure MTTR. Rotates through all critical services every quarter. Runbook executed by the on-call SRE.
Full HA Failover
Whole primary node taken offline. Secondary takes complete production traffic. Soak for 24 hours under real shift load before failback.
Site-Level DR Test
Full plant cutover to DR site. Production-floor users opt-in for a planned weekend window. Confirms end-to-end recovery beyond hardware.
Failover As the Operator Experiences It — Almost Nothing
When a primary node fails on a Tier-3 cluster, the operator sees a one-second pause and then their query completes. The dialogue below is what that looks like in the operator chat.
SRE note — primary node fault detected at 14:22:04. Failover completed in 28 seconds. Operator response time was 1.2 seconds — within the sticky-retry budget. No data lost.
The 12-Week Reliability Rollout
Reliability is configured during the same 8–12 week build as the AI brain itself. The phases below add the redundancy, the drills, and the runbook on top of the standard deployment timeline.
- Redundant DGX nodes installed, dual PSU, RAID 10 NVMe
- Monitoring stack deployed — Prometheus, alerts, on-call rota
- Backup target provisioned, retention policy approved
- Single-node baseline metrics collected
- HA pair configured, leader election validated
- First failover drill, MTTR measured and tuned
- 3-2-1 backup pipeline live, restore tested
- Replication to DR site started, lag tuned
- Full DR site live, end-to-end failover drilled
- Runbooks signed off by plant IT and SRE
- SLA active, monthly drill cadence begins
- 90-day uptime measurement window opens
What Plants Measure on Reliability After Go-Live
Numbers below are aggregated across iFactory deployments running the standard Tier-3 (2N) reliability package on NVIDIA DGX.
See a live failover demo at SAP Sapphire 2026
Watch a Tier-3 cluster failover end-to-end against live SAP traffic — operator chat continues, sensors keep ingesting, write-back queue drains automatically. Book a 20-minute walkthrough.
Frequently Asked Questions
Do I need to buy NVIDIA servers separately?
No. Fully-loaded NVIDIA DGX AI servers are supplied and installed as part of the iFactory package — including the HA pair, dual PSUs, ECC RAM, and NVMe RAID. They ship pre-racked, pre-cabled, with all NeMo, RAPIDS, NIM, and Agent Toolkit components pre-installed alongside the reliability and monitoring stack. You provide power and Ethernet. We provide the rest.
What does "99.9% uptime" actually mean in practice?
It means up to 8.76 hours per year of cumulative downtime is within the SLA. In practice, iFactory deployments running the standard Tier-3 package measure 99.94% across rolling 90-day windows — comfortably above the 99.9% target. Higher targets are available with Tier-4 and Tier-5 packages.
Is failover automatic or does someone press a button?
Automatic for node-level failures within a cluster. Three missed heartbeats trigger leader election, secondary is promoted, traffic reroutes — no human in the loop. Site-level DR cutover is operator-initiated by design, with a runbook the on-call SRE follows. This prevents accidental DR activation during transient network issues.
How long do backups live before they are deleted?
Tunable per artifact class. Defaults — model snapshots kept 90 days, configuration backups kept 1 year, audit logs kept 7 years for regulated industries. Cold-tier object storage compresses long-term retention cost without hurting recovery times.
Can I run active-active across two sites?
Yes, at Tier-5. Both sites serve production traffic; one is the leader for write paths. Used for plants that need geo-resilience without an RTO penalty during failover. Adds latency budget management at the application layer; recommended only when 99.999% is the genuine target.
What happens if SAP goes down — does the AI go offline too?
No. The AI continues operating against the local cache of SAP master and transactional data. Read-only copilot queries remain live. Write-back queues hold pending transactions until SAP returns and drain in original order with exactly-once semantics. Nothing is lost; nothing is duplicated.
How are firmware and OS patches applied without downtime?
Rolling upgrade. One node drained, patched, re-validated, then re-joined to the cluster; partner node takes the full load during the window. iFactory schedules patch windows monthly, coordinates with plant maintenance calendars, and rolls back automatically on any post-patch health regression.
What is the recommended distance for the DR site?
Far enough to avoid common-cause site failures — power grid, regional weather, building damage. In practice, 50 km or more is sufficient for most plants. The constraint is replication lag — sub-10-second lag is the iFactory target, which holds comfortably below 200 km on a reasonable network link.
Build Reliability In on Day One. Not on the Day of the First Outage.
Redundant hardware. Drilled-quarterly DR. 3-2-1 backup. 99.9% uptime as a real measured number, not a marketing one. Shipped turnkey on NVIDIA DGX in 8–12 weeks.






