A modern factory generates staggering volumes of sensor data. 5,000 sensors reporting at 1-second intervals produce 432 million data points per day — 157 billion per year. Add vibration waveforms sampled at 25 kHz and the numbers explode further. Traditional relational databases — MySQL, PostgreSQL without time-series extensions, SQL Server — collapse under this load. Queries on 100 million rows take minutes. Inserts queue and backpressure causes data loss. Analytics teams wait hours for dashboards to render, and AI models train on yesterday's data instead of today's. The technology exists to handle this scale: purpose-built time-series databases like InfluxDB (now at version 3.0 with Apache Parquet columnar storage and millions of data points per second ingestion), TimescaleDB (PostgreSQL extension with hypertables and continuous aggregates), and ClickHouse (columnar analytics engine for complex queries at scale). Stream processing platforms like Apache Kafka and AWS Kinesis handle the ingestion firehose. But choosing the right combination — and sizing it correctly for your specific data volume, query patterns, and retention requirements — is where most factory data projects fail. We design complete sensor data pipelines for greenfield factories: ingestion from MQTT brokers, stream processing for real-time analytics, time-series database deployment with optimized storage tiering, query patterns tuned for dashboard performance, and AI-ready data lakes for model training — so your data infrastructure scales from commissioning day through full production without redesign. Schedule a Demo
Why Relational Databases Fail at Factory Scale
Write Bottleneck
Relational databases maintain B-tree indexes on every insert. At 5,000 writes per second (one per sensor per second), index updates consume more CPU than the data itself. At 50,000 writes per second (vibration + process data), the database falls behind, queues build, and data loss begins. Time-series databases use append-only storage engines (TSM, hypertables) optimized for sequential timestamped writes — handling millions of writes per second.
Query Performance
A dashboard query asking "average temperature for Machine 1 over the last 8 hours" requires scanning millions of rows in a relational table. Without time-based partitioning, the database performs a full table scan. Query time: 30 seconds to 5 minutes. Time-series databases partition data by time automatically (hypertables, shards) and pre-compute aggregates (continuous aggregates, retention policies) — the same query returns in under 100 milliseconds.
Storage Bloat
Relational databases store each row with full overhead (row headers, null bitmaps, index entries). A single float64 sensor reading of 8 bytes requires 40-80 bytes of storage overhead. Time-series databases use columnar compression (Parquet, Gorilla encoding for timestamps, delta encoding for monotonic values) achieving 10-20x compression — the same data stored in 4-8 bytes per point instead of 40-80.
No Lifecycle Management
Factory data has clear lifecycle requirements: 1-second resolution for 7 days (dashboards), 1-minute averages for 90 days (trending), hourly aggregates for 2 years (reporting), daily summaries forever (compliance). Relational databases don't support automatic downsampling or retention-based deletion. Time-series databases do this natively — data ages through resolution tiers automatically.
Still running factory sensors on MySQL or SQL Server? Schedule a demo to see how purpose-built time-series databases handle 432 million points per day with sub-100ms query response — at 10-20x less storage cost.
Pipeline Architecture
Sensor data arrives via MQTT from edge gateways (Sparkplug B payloads). MQTT broker (HiveMQ, EMQX, Mosquitto) handles fan-in from all machines. For high-throughput factories (>100K messages/sec): Apache Kafka or AWS Kinesis Data Streams as the ingestion backbone — providing durable, replayable message queuing with exactly-once semantics. Kafka partitions by machine ID for parallel processing. Retention: 7 days in Kafka for replay capability.
Stream processor (Kafka Streams, Apache Flink, or Telegraf for simpler pipelines) consumes from Kafka topics and performs: data validation (range checks, type enforcement), timestamp normalization (PTP-synchronized), unit conversion, deadband filtering (only write changed values), and real-time aggregation (1-minute rollups computed on the fly). Output: clean, validated, timestamped data points ready for storage. Processing latency: <100ms end-to-end.
Validated data written to time-series database: InfluxDB for pure time-series workloads with maximum ingestion speed, TimescaleDB when SQL compatibility and relational joins are required, or ClickHouse for complex analytical queries at massive scale. Hot tier stores 7-30 days at full resolution. Continuous aggregates pre-compute common dashboard queries (1-min, 5-min, 1-hour rollups). Query response: <100ms for dashboard time ranges.
Grafana dashboards query TSDB directly for real-time and recent historical data. REST/gRPC APIs serve data to MES, CMMS, and custom applications. AI/ML training pipelines read from the data lake (cold tier) via Apache Arrow or Parquet files — batch reads of months/years of historical data for model training. Anomaly detection models query the hot tier for real-time inference.
Time-Series Database Comparison
| Feature | InfluxDB 3.0 | TimescaleDB 3.0 | ClickHouse |
|---|---|---|---|
| Architecture | Purpose-built TSDB; columnar (Apache Parquet) | PostgreSQL extension with hypertables | Columnar OLAP engine |
| Query Language | SQL (native in v3), InfluxQL, Flux | Full PostgreSQL SQL | ClickHouse SQL dialect |
| Write Speed | Millions of points/sec; best raw ingestion | Hundreds of thousands/sec; excellent with batching | Millions of rows/sec; batch-optimized |
| Query Speed | Fast for simple time-range queries | Best for complex joins + aggregations (3.5-71x faster) | Fastest for analytical/OLAP queries |
| High Cardinality | Improved in v3 but historically weak | Excellent — handles millions of unique tags | Excellent — columnar design scales naturally |
| Compression | 10-20x (Gorilla + delta + Parquet) | 10-15x (PostgreSQL TOAST + TimescaleDB compression) | 10-40x (LZ4/ZSTD columnar compression) |
| Ecosystem | Telegraf, Grafana, Kapacitor | Full PostgreSQL ecosystem (PostGIS, pg_partman, BI tools) | Grafana, Metabase, dbt, Kafka Connect |
| Best For | Pure IoT/sensor ingestion; monitoring pipelines | IoT + relational analytics; SQL-native teams | Large-scale analytical queries; data warehousing |
| Deployment | Cloud, on-prem, edge (open source core) | Cloud (Tiger), on-prem, self-managed PostgreSQL | Cloud (ClickHouse Cloud), on-prem, self-managed |
Storage Tiering: Hot / Warm / Cold
Every data point at original sample rate (1-second, 100ms, or whatever the sensor produces). Stored on NVMe SSD for sub-100ms query response. Pre-computed continuous aggregates for dashboard performance. Storage: ~50 GB per 1,000 sensors per month at 1-second intervals (after compression). This is your real-time operations tier — dashboards, alerts, anomaly detection, and shift reports all query here.
1-minute averages (min/max/avg/count per minute) on SSD or fast HDD. Full-resolution data downsampled automatically by continuous aggregate jobs. Reduces storage by 60x (from 1-second to 1-minute resolution). Supports trending analysis, monthly reporting, and shift-by-shift comparisons. Query response: <1 second for typical time ranges. Automated migration from hot to warm via retention policies — zero manual intervention.
Hourly or daily aggregates on object storage (S3, Azure Blob, MinIO). Parquet files in a data lake for long-term retention at minimal cost. Supports annual reports, regulatory compliance, equipment lifecycle analysis, and AI model training on historical data. Query response: seconds to minutes (acceptable for historical analysis). Storage cost: $0.02-$0.05/GB/month on object storage vs $0.10-$0.30/GB/month on SSD.
Daily aggregates (min/max/avg per day per sensor) in compressed Parquet on cold storage or Glacier. Permanent retention for equipment lifecycle records, warranty claims, and regulatory audit trails. Storage cost: $0.004/GB/month (S3 Glacier). A 10-year archive for 5,000 sensors occupies less than 50 GB — effectively free to store.
Query Optimization for Dashboards
Pre-Computed Aggregates
Dashboards showing "last 8 hours at 1-minute resolution" should never query raw 1-second data and aggregate on the fly. Continuous aggregates (TimescaleDB) or retention policy downsampling (InfluxDB) pre-compute 1-minute min/max/avg during ingestion. The dashboard reads pre-computed results — query time drops from 5 seconds to 50 milliseconds. Design rule: every dashboard panel should query a pre-computed aggregate, never raw data.
Materialized Views
Frequently accessed cross-machine comparisons (OEE by line, downtime Pareto, quality trends) are materialized as views that refresh every 5-60 seconds. The view computation runs once; all concurrent dashboard users read the cached result. Without materialization, 20 concurrent dashboard users execute 20 identical queries — each scanning the same data. With materialization: one query, 20 cache hits.
Partitioning by Machine
Time-series databases partition by time automatically. Adding a secondary partition by machine_id (TimescaleDB space partitioning, InfluxDB tag-based sharding) ensures that queries filtered by machine scan only relevant partitions. A query for "Machine 5 vibration last hour" skips all data from Machines 1-4 and 6-1000 — reducing I/O by 99% in a 1,000-machine factory.
Edge Pre-Processing
Not every data point needs to reach the central database. Edge gateways compute local statistics (RMS, peak, average) and publish aggregated results — reducing data volume 10-100x before it hits the pipeline. Example: a vibration sensor sampling at 25 kHz generates 25,000 samples/second. The edge gateway computes RMS, peak frequency, and kurtosis every second and publishes 3 values instead of 25,000. Central TSDB stores the metrics; raw waveforms stored locally at the edge and pulled on demand.
AI Training Data Lake
Parquet Export Pipeline
Historical sensor data exported from the time-series database to Parquet files in the data lake (S3, Azure Data Lake, MinIO) on a daily or weekly schedule. Parquet's columnar format is optimal for ML training — read only the columns you need (timestamp, temperature, vibration_rms) without loading the rest. Apache Arrow provides zero-copy reads for Python/PyTorch/TensorFlow. A year of 5,000-sensor data at 1-minute resolution: approximately 2.6 billion rows, compressed to ~100 GB in Parquet.
Feature Store Integration
Pre-computed features (rolling averages, lag values, FFT components, statistical moments) stored alongside raw data in the feature store. ML training jobs read features directly — no recomputation per training run. Feature definitions version-controlled and reproducible. In greenfield: feature computation pipelines designed during data architecture phase, running on the stream processor (Flink/Kafka Streams) in real-time and batch-exported to the feature store daily.
Labeled Data Management
Maintenance events (from CMMS), quality defects (from QMS), and downtime records (from MES) joined with sensor data by timestamp to create labeled training datasets. The join happens in the data lake — not in the operational TSDB. Label quality directly determines AI model quality: every CMMS work order must include precise start/end timestamps and failure mode classification for the labels to be useful for supervised learning.
Model Training Infrastructure
GPU-accelerated training servers (on-premise NVIDIA A100/H100 or cloud instances) read from the data lake via high-throughput connections. Data pipeline provides train/validation/test splits with temporal awareness — no data leakage from future to past. Model versioning (MLflow, Weights & Biases) tracks which dataset version trained which model. Inference models deployed back to the edge, reading from the hot TSDB tier for real-time prediction.
Key Benefits & ROI
432 Million Points Per Day. Sub-100ms Queries. Zero Data Loss.
iFactory designs sensor data pipelines and time-series database architectures for greenfield factories — MQTT ingestion, stream processing, TSDB deployment, storage tiering, query optimization, and AI data lakes — scaled to your sensor count and operational from commissioning day.
Frequently Asked Questions
Your Sensors Are Talking at 432 Million Points Per Day. Is Anyone Listening?
Purpose-built time-series databases with optimized pipelines handle the volume, velocity, and variety of factory sensor data — at 10-20x less storage cost and 100x faster queries than relational databases.







