Acoustic & Vibration CNN-Autoencoder — Hear a Failure Before It Happens

By Larry Eilson on May 8, 2026

vibration-acoustic-cnn-autoencoder

A bearing rarely fails silently. By the time the temperature rises, the vibration alarm trips, or the lube oil shows iron wear particles, the failure is already weeks deep. The earliest signal — the very first sign that something inside the machine is wrong — is usually acoustic. A 2 kHz harmonic creeps into the mel-spectrogram of a feed-pump motor that wasn't there last month. A faint 47 Hz modulation appears on a fan that's been running for ten years. A human ear walking the floor would notice some of these on a good day. A continuous CNN-LSTM autoencoder on every asset notices all of them, every shift, on a model trained on what your specific machine sounds like when it's healthy. iFactory's Acoustic + Vibration Anomaly stack puts an industrial microphone (or taps the existing accelerometer) on each critical asset, streams audio at 16 kHz to an NVIDIA Jetson AGX Orin edge gateway, computes mel-spectrograms in a 2-second sliding window, and runs a CNN-LSTM autoencoder that flags every spectral region the model can't reconstruct from learned-normal. Reconstruction-error map on the dashboard. CMMS work order auto-drafted on threshold breach. The model trains on the on-prem NVIDIA RTX PRO 6000 Blackwell + GB300 stack; inference runs at the edge in milliseconds. Live in 6 weeks from PO. See the full pipeline running on a real bearing rig at the iFactory booth, SAP Sapphire Orlando, May 11–13, 2026 — register here.

SAP SAPPHIRE ORLANDO · MAY 11–13, 2026 · LIVE BEARING RIG DEMO
CNN-LSTM AUTOENCODER · MEL-SPECTROGRAM SLIDING WINDOW · ON JETSON AGX ORIN

Acoustic + Vibration CNN-Autoencoder
Hear A Bearing Fail Weeks Before It Trips

Industrial microphones and IEPE accelerometers stream audio and vibration at 16–25.6 kHz to NVIDIA Jetson AGX Orin gateways. Mel-spectrograms computed in 2-second sliding windows. CNN-LSTM autoencoder reconstructs each window from a model of your machine's healthy sound. Reconstruction error above the learned threshold flags an anomaly within 100 ms. Heatmap on the dashboard. Work order drafted in your CMMS. Engineer reviews. Operator commits. The AI never writes to the PLC.

97%+
Bearing-fault detection accuracy reported in peer-reviewed CNN-LSTM-AE studies
2 sec
Sliding-window length per spectrogram inference
Less than 100 ms
Inference latency per window on Jetson AGX Orin DLA
6 weeks
PO to live anomaly score on your floor
Why Audio Matters

The Failure Was Audible Before It Was Measurable — Vibration Trips Are Late Signals

Conventional condition monitoring waits for the RMS vibration value to cross a threshold. By that point the bearing race is spalled, the fan blade is cracked, or the gear tooth is chipped. Acoustic monitoring catches the same fault weeks earlier because the spectral fingerprint changes long before the energy magnitude does. A new harmonic at a bearing's ball-pass frequency, a faint cavitation hiss in a pump volute, a subtle modulation from a loose stator winding — all of these show up on the mel-spectrogram while the overall vibration RMS is still inside its alarm band. Talk to our acoustic AI lead about which assets on your floor would benefit most.

RMS-THRESHOLD MONITORING
Single-number alarm fires when energy crosses a line

Vibration trip at 7.1 mm/s. Bearing already at end-of-life. Trip is the warning, not the early-warning. The acoustic content has been screaming for weeks but nobody was listening because the only metric tracked was the integrated magnitude. Most failures arrive as surprises this way.

CNN-LSTM AUTOENCODER ON EDGE
Reconstructs each spectrogram window from learned-normal — flags what it cannot reconstruct

The model knows what your specific motor sounds like at this load, this speed, this ambient. New harmonic appears, reconstruction error spikes. The anomaly score climbs days before vibration RMS moves. Engineer sees the spectral region that drove the alert. Time-to-failure projected. CMMS work order drafted.

AI WRITES TO MACHINE CONTROL
The line we don't cross

An audio AI that trips a motor or e-stops a line without a human gate is not an anomaly engine — it's an unvalidated controller bolted onto safety logic. The Acoustic + Vibration stack has no write path to PLC, VFD, or BMS. It scores. It alerts. It drafts. The maintenance lead reviews. Operations decides.

The Pipeline

Six Stages From The Microphone To The Reconstruction-Error Heatmap

Acoustic AI is not just "throw audio at a model". The pipeline matters. Each stage runs at a defined sample rate, has a fixed memory budget, and produces an output the next stage consumes. The non-technical version: sound becomes a picture, the picture goes through a model, the model says how surprising the picture is, surprising pictures become alerts. The technical version is below.

01
CAPTURE · ON THE MACHINE
Microphone or Accelerometer
Continuous 16–25.6 kHz

IP65 industrial microphone in a stainless housing, mounted 0.3–1 m from the asset. Or an existing IEPE accelerometer tapped from the vibration analyser's output. Audio sampled at 16 kHz; vibration at 25.6 kHz. Bandpass filter rejects ambient HVAC and forklift noise.

OutputRaw waveform · 24-bit PCM · ring-buffered 60 s
02
SEGMENT · IN THE GATEWAY
Sliding Window Extractor
2 s window · 0.5 s hop

Stream chopped into 2-second segments with a 0.5-second hop. This window length is the sweet spot — long enough to capture a full bearing rotation at typical speeds, short enough to fit Jetson memory and let the model run faster than real time.

Output2 s × N segments per second · 16,000 samples each
03
TRANSFORM · ON ORIN GPU
Mel-Spectrogram
Tens of ms per window

Short-time Fourier transform with 1024 FFT bins, 50% overlap, then mel filterbank with 128 mel bands. Output is a 128 × 128 image-like representation. Mel scaling is the standard preprocessing in peer-reviewed acoustic-anomaly work because it concentrates discriminative energy in the bands where bearing and gear faults live.

Output128 × 128 mel-spectrogram tensor per window
04
INFERENCE · ON ORIN DLA
CNN-LSTM Autoencoder
Less than 100 ms per window

Convolutional encoder extracts spatial features from the spectrogram. LSTM stack captures temporal dependency across the 0.5 s hop sequence. Decoder reconstructs the spectrogram. The reconstruction error per pixel is the anomaly score — model trained only on normal data, so anything it can't reconstruct is "not normal".

OutputAnomaly score 0–100 + per-mel-band error map
05
DASHBOARD · ANY SCREEN
Heatmap + Score
Less than 1 s

Anomaly score rendered on the asset tile. Drill-down shows the mel-spectrogram with the suspect band highlighted, the 7-day score trend, and the projected failure mode (bearing inner-race, outer-race, cage, gear-mesh, cavitation, electrical fault). Engineer sees the picture and the curve, not just a number.

OutputLive tile · drill-down · alert hooks (e-mail / SMS / Teams)
06
CMMS · AUTO-DRAFT
Work Order, Reviewed, Released
Seconds to draft

Score above the learned threshold drafts a work order in OxMaint, SAP PM, IBM Maximo, or Infor EAM — pre-filled with the asset, the suspected mode, the dominant mel band, and the projected time-to-failure. Maintenance lead reviews and releases. The AI never auto-releases.

OutputDraft CMMS work order · audit trail of every decision
Model Architecture

CNN-LSTM Autoencoder — How It Sees A Spectrogram And Decides It's Wrong

The model is a hybrid because the signal has two structures. The convolutional layers handle the spatial structure of the spectrogram — patterns across frequency and short time. The LSTM layers handle the temporal structure — how those patterns evolve across the sliding-window sequence. The autoencoder bottleneck forces the network to compress what's important and discard what isn't. At inference time, anything the model can't compress and reconstruct is, by definition, outside the learned-normal distribution. That's the anomaly.

INPUT
Mel-spectrogram
128 × 128

2-second window, 128 mel bands, 128 time frames. One per 0.5 s hop.

CNN ENCODER
3 × Conv2D + ReLU + MaxPool
128 → 64 → 32

Extracts spatial fault signatures across frequency. Filters learn bearing-defect harmonics, gear-mesh sidebands, cavitation noise.

LSTM ENCODER
2 stacked LSTM layers
256 → 128 units

Captures temporal dependency across the sequence of CNN feature maps. Learns how the spectrogram evolves over the 0.5 s hop.

BOTTLENECK
Latent vector
64-d

The compressed essence of "what your machine sounds like normally". Anything that can't be expressed here is novel.

LSTM DECODER
2 stacked LSTM layers
128 → 256 units

Mirror of the encoder. Reconstructs the temporal evolution from the latent vector.

CNN DECODER
3 × ConvTranspose2D + ReLU
32 → 64 → 128

Reconstructs the spectrogram. The closer the reconstruction matches the input, the more "normal" the input was.

OUTPUT
Reconstruction error map
128 × 128

Per-pixel difference between input and reconstruction. Sum is the anomaly score; map shows the suspect band.

Why this beats a single CNN or single LSTM: a CNN alone treats each spectrogram independently, missing slow degradation patterns over many windows. An LSTM alone struggles with the high-dimensional spatial structure of the spectrogram. The hybrid handles both — and that's why peer-reviewed bearing-fault studies on this architecture report detection accuracy above 97%. See the model architecture rendered live in Orlando.

Memory + Latency Profile

Why It Runs On A Jetson AGX Orin And Not A Cloud GPU

For acoustic anomaly to be useful, inference has to be continuous and local. Sending 16 kHz raw audio to a cloud inference endpoint is a non-starter on bandwidth and latency. The whole pipeline is engineered to fit inside the AGX Orin's resource envelope: low memory footprint per model, low compute per window, and the heavy operations pushed onto the dedicated DLA accelerator so the GPU stays free for the rest of the workload. The on-prem RTX PRO 6000 Blackwell + GB300 stack handles model training and retraining; inference stays at the edge.

Resource Per Asset Per AGX Orin Gateway Headroom
Model size About 12 MB One model per asset · 30+ assets per Orin Plenty of LPDDR5 left for buffers
Inference latency Less than 100 ms per 2 s window Runs on DLA · GPU stays free Inference is faster than real time by 20×
Memory at runtime About 90 MB working set Less than 4 GB across 30 models 64 GB LPDDR5 unified · most unused
CPU usage Less than 5% of one Cortex-A78AE core Less than 30% across all assets Plenty for OPC-UA / EtherNet-IP / RTSP work
Audio bandwidth About 768 kbps per asset (16 kHz × 24-bit × 2 channels) About 23 Mbps for 30 assets Local · never leaves the gateway VLAN
Cloud egress Zero Zero Audio stays on-prem · regulated facilities approve
Power per gateway 15–60 W configurable One DIN-rail enclosure Fits inside the existing control panel
The On-Prem Stack

Two Tiers — Edge Inference On Jetson, Training & Retraining On RTX PRO 6000 + GB300

Acoustic anomaly is a two-tier problem. Inference runs on every asset, every second, forever — that lives at the edge on the Jetson AGX Orin. Training and monthly retraining is a one-day-per-month batch job that needs heavyweight GPU compute — that lives on the on-prem RTX PRO 6000 Blackwell digital-twin server, paired with the NVIDIA GB300 Grace Blackwell Ultra for the heavy retraining sweeps. All three nodes ship pre-configured. Walk the rack at the iFactory booth in Orlando.


NVIDIA Jetson AGX Orin
Edge AI gateway · runs CNN-LSTM AE inference on every asset, continuously
ModuleNVIDIA Jetson AGX Orin
CPU12-core ARM Cortex-A78AE
GPU2048-core Ampere + 2x DLA accelerators
Memory64 GB unified LPDDR5
PLCOPC-UA · EtherNet/IP · Modbus TCP client native
Tag syncReal-time, less than 10 ms PLC latency
Audio I/OMulti-channel USB / Ethernet audio · 16–48 kHz
Form factorIndustrial DIN-rail enclosure · IP-rated

RTX PRO 6000 Blackwell
Digital twin server · holds the dashboard, the model registry, monthly training
GPUNVIDIA RTX PRO 6000 Blackwell, 96 GB
CPUAMD Ryzen 7 9900X · 12-core
RAM128 GB DDR5 6000 MHz
Storage2 TB NVMe M.2 SSD
Pre-loadedDashboard · model registry · audit-log writer
OSUbuntu 25
Network2.5 Gb Ethernet · IEC 62443 zoned
Form factorMid Tower ATX · racked on-site

NVIDIA GB300 Grace Blackwell Ultra
Heavy retraining node · monthly hyperparameter sweeps across all assets
ChipNVIDIA GB300 Grace Blackwell Ultra Superchip
Memory288 GB HBM3e high-bandwidth memory
CPU72-core ARM Grace, 2x energy efficiency vs. leading server CPUs
GPU classBlackwell Ultra · 1.5x dense FP4 over GB200
CoolingLiquid-cooled · sized to 110% of rated TDP
NetworkNVIDIA Spectrum-X · ConnectX-8 SuperNIC
WorkloadMulti-asset retraining · hyperparameter sweeps · large batch
Air gapNo public internet path · on-prem only

Why split inference and training: inference must be deterministic and local — the AGX Orin handles that on its DLA accelerator without competing for GPU. Training is bursty and heavy — that goes to the RTX PRO 6000 + GB300 once a month, on data the AGX Orin has already streamed back. Splitting them is how you get sub-100 ms anomaly detection on every asset without sending audio to the cloud.

Failure Modes The Model Catches

Six Real Patterns The CNN-LSTM Autoencoder Surfaces Early

The model doesn't classify by name — it flags spectral regions where reconstruction breaks down. But each common rotating-equipment failure mode produces a recognisable signature on the mel-spectrogram, and the engineer's drill-down view names the suspect mode based on which mel band lights up. Six examples your maintenance lead will recognise.

PATTERN 01
Bearing Outer-Race Defect

What you'd hear: faint repetitive click at the BPFO frequency, modulated at shaft speed.

Where it shows up: a new harmonic series in the 1–4 kHz mel bands, sidebands at 1× shaft speed.

Lead time: typically 3–6 weeks before vibration RMS climbs into alarm.

PATTERN 02
Bearing Inner-Race Defect

What you'd hear: sharper, faster click at BPFI, modulated more strongly by load.

Where it shows up: spectral peak around 2–6 kHz with cage-rotation sidebands.

Lead time: typically 2–5 weeks before RMS alarm.

PATTERN 03
Gear-Mesh Tooth Damage

What you'd hear: a "growl" once per revolution layered over the steady mesh tone.

Where it shows up: sidebands at shaft speed around the gear-mesh frequency in the 0.5–3 kHz range.

Lead time: typically 4–8 weeks; fault grows fast once visible.

PATTERN 04
Pump Cavitation

What you'd hear: a high-frequency hissing or "gravel" sound when suction conditions drop.

Where it shows up: broadband energy lift in the 5–10 kHz mel bands.

Lead time: immediate — model flags cavitation events shift-by-shift, before NPSH alarms.

PATTERN 05
Loose Stator / Electrical Fault

What you'd hear: a 2× line-frequency hum that wasn't there before, often with sidebands.

Where it shows up: tone at 100 Hz (50 Hz mains) or 120 Hz (60 Hz mains) with slip-frequency modulation.

Lead time: 1–4 weeks before motor current signature shows it.

PATTERN 06
Belt & Coupling Misalignment

What you'd hear: low rumble at belt-pass frequency or 2× shaft for misalignment.

Where it shows up: low-frequency mel bands (50–500 Hz) with growing energy.

Lead time: 2–6 weeks; gives you time to schedule a planned realignment.

Engineer + Operations View

Same Alert, Two Levels — Maintenance Lead & Reliability Engineer

A maintenance lead wants to know: which asset, how urgent, what work order to release. A reliability engineer wants to see the spectrogram, the suspect band, the trend, the model confidence, and decide whether the alert is a real fault or a known transient. Same alert, two levels of detail, served from the same data.

MAINTENANCE LEAD · NON-TECHNICAL
Asset tile, urgency, draft work order
What you see Asset coloured by anomaly score · 7-day trend arrow · projected mode
What it tells you "Pump P-12: amber, outer-race bearing suspected, plan replacement in next 4 weeks"
Action Approve drafted CMMS work order or escalate to reliability engineer
Cost framing Estimated emergency-vs-planned cost delta if work is deferred
RELIABILITY ENGINEER · TECHNICAL
Spectrogram, error map, suspect band, model confidence
What you see Live mel-spectrogram · per-band reconstruction error map · top-3 suspect bands
Trend Score over 7 / 30 / 90 days · per-mode tracking
Confidence Model variance · steady-state quality · training coverage at this RPM/load
Override Flag false positive · feeds the next monthly retrain
6-Week Rollout

From PO To Live Anomaly Score In Six Weeks Flat

Acoustic AI deploys faster than most plant AI work because it doesn't touch your control system. Microphones or accelerometers attach to the asset; the AGX Orin gateway sits on its own VLAN; the RTX PRO 6000 + GB300 server lives in your IT room. Nothing writes to the PLC. Six-week rollout for a 30-asset fleet is the standard timeline. Expand to additional fleets afterward on a schedule operations controls.

PHASE 1 · WEEKS 1–2
Mount · Wire · Stream
Microphones, accelerometers, gateway online
2 weeks

Industrial microphones mounted near critical assets; existing accelerometers tapped where available. AGX Orin gateway racked, configured, on its own VLAN. Audio + vibration streaming. RTX PRO 6000 + GB300 racked in IT room. Asset list signed off with reliability lead.

Deliverable: streaming audio + vibration
PHASE 2 · WEEKS 3–4
Train · Shadow
Per-asset CNN-LSTM-AE trained, scoring shadow
2 weeks

One CNN-LSTM-AE per asset trained on healthy data. Reconstruction-error thresholds calibrated. Shadow-mode scoring runs — visible to reliability engineer, not surfaced to maintenance lead. Known transients (start-up, shutdown, load shifts) characterised and excluded.

Deliverable: shadow scores + per-asset model card
PHASE 3 · WEEKS 5–6
Go-Live · Train
Reliability-reviewed alerts, CMMS hook live
2 weeks

Scores promoted from shadow to alert queue. CMMS work-order auto-draft enabled. 2-day on-site training for reliability engineers and maintenance leads. 24x7 remote monitoring active. False-positive override workflow live and feeds the next retrain.

Deliverable: production alerts + trained team
YEAR 1 · ONGOING
Run · Recalibrate
Monthly retrain, quarterly review
12 months

Per-asset models retrained monthly on the GB300 with fresh recordings. Quarterly review with our acoustic AI lead — accepted alert rate, prevented failures, false-positive rate, model drift per asset. Optional after year one. Stack keeps running either way.

Deliverable: quarterly performance pack
What You Get

Microphones, Gateway, Server, Models, Training — One PO

The Acoustic + Vibration stack ships as a turnkey kit: industrial microphones for the assets that need them, the AGX Orin edge gateway, the RTX PRO 6000 + GB300 training pair, the model scaffolding, the dashboard, the CMMS hook, plus our acoustic AI engineers on the floor for sensor placement, training, and reliability handover. 6 weeks from PO. Owned by you outright. No recurring license.

01
AGX Orin Edge Gateway

Pre-configured, DIN-rail mount, OPC-UA / EtherNet-IP / Modbus TCP client. Runs CNN-LSTM-AE inference on every asset on its DLA — GPU stays free for parallel work. Less than 100 ms per 2 s window. 30+ assets per gateway.

02
RTX PRO 6000 + GB300 Training Pair

Pre-racked, burn-in tested, IEC 62443 zoned. RTX PRO 6000 holds the dashboard and model registry; GB300 handles monthly retraining sweeps across all assets. Air-gapped. One-time CapEx. Global shipping included.

03
Microphones & Sensor Kit

IP65 industrial microphones in stainless housings for the assets that need them. IEPE accelerometer taps where existing vibration analysers can be shared. Cabling, mounting, and commissioning by our field engineers.

04
CNN-LSTM-AE Software

Per-asset model scaffolding, mel-spectrogram pipeline, sliding-window extractor, reconstruction-error scorer, anomaly dashboard, drill-down view, e-mail / SMS / Teams alert hooks, audit log. Calibrated to your assets during weeks 3–4.

05
CMMS Auto-Draft Hook

Pre-built integration to OxMaint, SAP PM, IBM Maximo, Infor EAM. Drafts a work order on each alert with asset, suspected mode, dominant mel band, time-to-failure. Maintenance lead reviews and releases. AI never auto-releases.

06
Training, Support & Recalibration

2-day on-site training for reliability engineers and maintenance leads. 24x7 remote monitoring of all stack nodes. Monthly per-asset model retrain. Quarterly performance review with our acoustic AI lead. Optional after year one.

FAQ

What Reliability Engineers & Maintenance Leads Ask First

Does it work in a noisy plant — compressors, forklifts, ambient HVAC?

Yes — the model is trained on your specific machine in your specific ambient. The "normal" it learns includes your background noise. The bandpass filter on the gateway and the mel-scale preprocessing concentrate energy in the frequency bands where rotating-machinery faults actually live, and steady ambient noise becomes part of the learned-normal envelope. Transients (forklift passing, air-cannon firing) are characterised in shadow mode during weeks 3–4 and excluded from alerts.

Do we need to install microphones, or can it use our existing vibration sensors?

Both options work. Where you have an IEPE accelerometer already in service, we tap it via the existing analyser's analog out — no new hardware. Where there's no sensor, we install an IP65 industrial microphone in a stainless housing 0.3–1 m from the asset. Most fleets are a mix. The CNN-LSTM-AE works on either signal because both reduce to the same mel-spectrogram representation.

Why not just use a single CNN, or a single LSTM?

A single CNN treats each spectrogram as an independent image and misses slow degradation patterns that emerge over many windows. A single LSTM struggles with the high-dimensional spatial structure of a 128 × 128 mel-spectrogram. The hybrid CNN-LSTM autoencoder uses convolutions for the spatial pattern and LSTMs for the temporal evolution — peer-reviewed bearing-fault studies on this architecture report detection accuracy above 97% with significantly fewer false positives than rule-based alarms.

How does it know what's a real fault vs. a known event?

During Phase 2 shadow mode, the reliability engineer reviews every shadow alert and tags known events (start-up, shutdown, load step, planned cleaning) so they go into the learned-normal envelope. After go-live, the override workflow lets engineers flag false positives with a reason, and that feedback feeds the next monthly retrain. The model gets sharper over time.

Where does our audio go?

Stays inside your perimeter. Audio is captured by the microphone, processed at the AGX Orin edge gateway, and only the mel-spectrogram and anomaly score are forwarded — not the raw audio. Even if you want to keep raw audio, it stays on the on-prem RTX PRO 6000 storage. The full stack is air-gapped from the public internet by default. No data leaves your zone. Your model is trained on your assets only — we don't share weights between customers.

What if we don't renew support after year one?

The stack keeps running. You own the AGX Orin gateways, the RTX PRO 6000 + GB300 pair, the trained models, the audit logs, and the dashboards. Renew support and monthly retraining annually, run it in-house with our handover docs, or do a mix. No kill switch, no recurring license.

SAP SAPPHIRE ORLANDO · MAY 11–13, 2026 · LIVE BEARING RIG

Walk The Pipeline Live At Orlando — Microphone, Spectrogram, Model, Alert

A real bearing rig with a clean bearing on one side and a deliberately damaged bearing on the other. The microphone. The AGX Orin gateway computing the mel-spectrogram. The CNN-LSTM-AE model rendering reconstruction error pixel by pixel. The alert lighting up the dashboard. Bring your asset list and a sample recording — our acoustic AI lead will walk through what the model would surface for your fleet. Can't make Orlando? Schedule a remote walk-through with the same stack.

3 nodes
AGX Orin + RTX PRO 6000 + GB300

6 weeks
PO to live alerts

$0
Recurring license fees

100%
On-prem · you own it

Share This Story, Choose Your Platform!