Sensor Data Pipeline and Time-Series Database Design in 2026

By lamine yamal on April 1, 2026

sensor-data-pipeline-time-series-database-design

A modern factory generates staggering volumes of sensor data. 5,000 sensors reporting at 1-second intervals produce 432 million data points per day — 157 billion per year. Add vibration waveforms sampled at 25 kHz and the numbers explode further. Traditional relational databases — MySQL, PostgreSQL without time-series extensions, SQL Server — collapse under this load. Queries on 100 million rows take minutes. Inserts queue and backpressure causes data loss. Analytics teams wait hours for dashboards to render, and AI models train on yesterday's data instead of today's. The technology exists to handle this scale: purpose-built time-series databases like InfluxDB (now at version 3.0 with Apache Parquet columnar storage and millions of data points per second ingestion), TimescaleDB (PostgreSQL extension with hypertables and continuous aggregates), and ClickHouse (columnar analytics engine for complex queries at scale). Stream processing platforms like Apache Kafka and AWS Kinesis handle the ingestion firehose. But choosing the right combination — and sizing it correctly for your specific data volume, query patterns, and retention requirements — is where most factory data projects fail. We design complete sensor data pipelines for greenfield factories: ingestion from MQTT brokers, stream processing for real-time analytics, time-series database deployment with optimized storage tiering, query patterns tuned for dashboard performance, and AI-ready data lakes for model training — so your data infrastructure scales from commissioning day through full production without redesign. Schedule a Demo

The Numbers Don't Lie: Factory Data at Scale
5,000 Sensors
×
1/sec Sample Rate
=
432M Points/Day
=
157B Points/Year
Add vibration waveforms at 25 kHz × 50 sensors = 108 billion additional samples per day

Why Relational Databases Fail at Factory Scale

X

Write Bottleneck

Relational databases maintain B-tree indexes on every insert. At 5,000 writes per second (one per sensor per second), index updates consume more CPU than the data itself. At 50,000 writes per second (vibration + process data), the database falls behind, queues build, and data loss begins. Time-series databases use append-only storage engines (TSM, hypertables) optimized for sequential timestamped writes — handling millions of writes per second.

X

Query Performance

A dashboard query asking "average temperature for Machine 1 over the last 8 hours" requires scanning millions of rows in a relational table. Without time-based partitioning, the database performs a full table scan. Query time: 30 seconds to 5 minutes. Time-series databases partition data by time automatically (hypertables, shards) and pre-compute aggregates (continuous aggregates, retention policies) — the same query returns in under 100 milliseconds.

X

Storage Bloat

Relational databases store each row with full overhead (row headers, null bitmaps, index entries). A single float64 sensor reading of 8 bytes requires 40-80 bytes of storage overhead. Time-series databases use columnar compression (Parquet, Gorilla encoding for timestamps, delta encoding for monotonic values) achieving 10-20x compression — the same data stored in 4-8 bytes per point instead of 40-80.

X

No Lifecycle Management

Factory data has clear lifecycle requirements: 1-second resolution for 7 days (dashboards), 1-minute averages for 90 days (trending), hourly aggregates for 2 years (reporting), daily summaries forever (compliance). Relational databases don't support automatic downsampling or retention-based deletion. Time-series databases do this natively — data ages through resolution tiers automatically.

Still running factory sensors on MySQL or SQL Server? Schedule a demo to see how purpose-built time-series databases handle 432 million points per day with sub-100ms query response — at 10-20x less storage cost.

Pipeline Architecture

Ingest
MQTT Broker → Stream Processor

Sensor data arrives via MQTT from edge gateways (Sparkplug B payloads). MQTT broker (HiveMQ, EMQX, Mosquitto) handles fan-in from all machines. For high-throughput factories (>100K messages/sec): Apache Kafka or AWS Kinesis Data Streams as the ingestion backbone — providing durable, replayable message queuing with exactly-once semantics. Kafka partitions by machine ID for parallel processing. Retention: 7 days in Kafka for replay capability.

Process
Real-Time Stream Processing

Stream processor (Kafka Streams, Apache Flink, or Telegraf for simpler pipelines) consumes from Kafka topics and performs: data validation (range checks, type enforcement), timestamp normalization (PTP-synchronized), unit conversion, deadband filtering (only write changed values), and real-time aggregation (1-minute rollups computed on the fly). Output: clean, validated, timestamped data points ready for storage. Processing latency: <100ms end-to-end.

Store
Time-Series Database (Hot Tier)

Validated data written to time-series database: InfluxDB for pure time-series workloads with maximum ingestion speed, TimescaleDB when SQL compatibility and relational joins are required, or ClickHouse for complex analytical queries at massive scale. Hot tier stores 7-30 days at full resolution. Continuous aggregates pre-compute common dashboard queries (1-min, 5-min, 1-hour rollups). Query response: <100ms for dashboard time ranges.

Serve
Dashboards, APIs, AI Models

Grafana dashboards query TSDB directly for real-time and recent historical data. REST/gRPC APIs serve data to MES, CMMS, and custom applications. AI/ML training pipelines read from the data lake (cold tier) via Apache Arrow or Parquet files — batch reads of months/years of historical data for model training. Anomaly detection models query the hot tier for real-time inference.

Time-Series Database Comparison

FeatureInfluxDB 3.0TimescaleDB 3.0ClickHouse
ArchitecturePurpose-built TSDB; columnar (Apache Parquet)PostgreSQL extension with hypertablesColumnar OLAP engine
Query LanguageSQL (native in v3), InfluxQL, FluxFull PostgreSQL SQLClickHouse SQL dialect
Write SpeedMillions of points/sec; best raw ingestionHundreds of thousands/sec; excellent with batchingMillions of rows/sec; batch-optimized
Query SpeedFast for simple time-range queriesBest for complex joins + aggregations (3.5-71x faster)Fastest for analytical/OLAP queries
High CardinalityImproved in v3 but historically weakExcellent — handles millions of unique tagsExcellent — columnar design scales naturally
Compression10-20x (Gorilla + delta + Parquet)10-15x (PostgreSQL TOAST + TimescaleDB compression)10-40x (LZ4/ZSTD columnar compression)
EcosystemTelegraf, Grafana, KapacitorFull PostgreSQL ecosystem (PostGIS, pg_partman, BI tools)Grafana, Metabase, dbt, Kafka Connect
Best ForPure IoT/sensor ingestion; monitoring pipelinesIoT + relational analytics; SQL-native teamsLarge-scale analytical queries; data warehousing
DeploymentCloud, on-prem, edge (open source core)Cloud (Tiger), on-prem, self-managed PostgreSQLCloud (ClickHouse Cloud), on-prem, self-managed

Storage Tiering: Hot / Warm / Cold

Hot
0-30 Days: Full Resolution, Instant Queries

Every data point at original sample rate (1-second, 100ms, or whatever the sensor produces). Stored on NVMe SSD for sub-100ms query response. Pre-computed continuous aggregates for dashboard performance. Storage: ~50 GB per 1,000 sensors per month at 1-second intervals (after compression). This is your real-time operations tier — dashboards, alerts, anomaly detection, and shift reports all query here.

Warm
30-365 Days: Downsampled, Fast Trending

1-minute averages (min/max/avg/count per minute) on SSD or fast HDD. Full-resolution data downsampled automatically by continuous aggregate jobs. Reduces storage by 60x (from 1-second to 1-minute resolution). Supports trending analysis, monthly reporting, and shift-by-shift comparisons. Query response: <1 second for typical time ranges. Automated migration from hot to warm via retention policies — zero manual intervention.

Cold
1-5 Years: Hourly Aggregates, Compliance Archive

Hourly or daily aggregates on object storage (S3, Azure Blob, MinIO). Parquet files in a data lake for long-term retention at minimal cost. Supports annual reports, regulatory compliance, equipment lifecycle analysis, and AI model training on historical data. Query response: seconds to minutes (acceptable for historical analysis). Storage cost: $0.02-$0.05/GB/month on object storage vs $0.10-$0.30/GB/month on SSD.

Archive
5+ Years: Daily Summaries, Permanent Record

Daily aggregates (min/max/avg per day per sensor) in compressed Parquet on cold storage or Glacier. Permanent retention for equipment lifecycle records, warranty claims, and regulatory audit trails. Storage cost: $0.004/GB/month (S3 Glacier). A 10-year archive for 5,000 sensors occupies less than 50 GB — effectively free to store.

Query Optimization for Dashboards

Pre-Computed Aggregates

Dashboards showing "last 8 hours at 1-minute resolution" should never query raw 1-second data and aggregate on the fly. Continuous aggregates (TimescaleDB) or retention policy downsampling (InfluxDB) pre-compute 1-minute min/max/avg during ingestion. The dashboard reads pre-computed results — query time drops from 5 seconds to 50 milliseconds. Design rule: every dashboard panel should query a pre-computed aggregate, never raw data.

Materialized Views

Frequently accessed cross-machine comparisons (OEE by line, downtime Pareto, quality trends) are materialized as views that refresh every 5-60 seconds. The view computation runs once; all concurrent dashboard users read the cached result. Without materialization, 20 concurrent dashboard users execute 20 identical queries — each scanning the same data. With materialization: one query, 20 cache hits.

Partitioning by Machine

Time-series databases partition by time automatically. Adding a secondary partition by machine_id (TimescaleDB space partitioning, InfluxDB tag-based sharding) ensures that queries filtered by machine scan only relevant partitions. A query for "Machine 5 vibration last hour" skips all data from Machines 1-4 and 6-1000 — reducing I/O by 99% in a 1,000-machine factory.

Edge Pre-Processing

Not every data point needs to reach the central database. Edge gateways compute local statistics (RMS, peak, average) and publish aggregated results — reducing data volume 10-100x before it hits the pipeline. Example: a vibration sensor sampling at 25 kHz generates 25,000 samples/second. The edge gateway computes RMS, peak frequency, and kurtosis every second and publishes 3 values instead of 25,000. Central TSDB stores the metrics; raw waveforms stored locally at the edge and pulled on demand.

AI Training Data Lake

Parquet Export Pipeline

Historical sensor data exported from the time-series database to Parquet files in the data lake (S3, Azure Data Lake, MinIO) on a daily or weekly schedule. Parquet's columnar format is optimal for ML training — read only the columns you need (timestamp, temperature, vibration_rms) without loading the rest. Apache Arrow provides zero-copy reads for Python/PyTorch/TensorFlow. A year of 5,000-sensor data at 1-minute resolution: approximately 2.6 billion rows, compressed to ~100 GB in Parquet.

Feature Store Integration

Pre-computed features (rolling averages, lag values, FFT components, statistical moments) stored alongside raw data in the feature store. ML training jobs read features directly — no recomputation per training run. Feature definitions version-controlled and reproducible. In greenfield: feature computation pipelines designed during data architecture phase, running on the stream processor (Flink/Kafka Streams) in real-time and batch-exported to the feature store daily.

Labeled Data Management

Maintenance events (from CMMS), quality defects (from QMS), and downtime records (from MES) joined with sensor data by timestamp to create labeled training datasets. The join happens in the data lake — not in the operational TSDB. Label quality directly determines AI model quality: every CMMS work order must include precise start/end timestamps and failure mode classification for the labels to be useful for supervised learning.

Model Training Infrastructure

GPU-accelerated training servers (on-premise NVIDIA A100/H100 or cloud instances) read from the data lake via high-throughput connections. Data pipeline provides train/validation/test splits with temporal awareness — no data leakage from future to past. Model versioning (MLflow, Weights & Biases) tracks which dataset version trained which model. Inference models deployed back to the edge, reading from the hot TSDB tier for real-time prediction.

Key Benefits & ROI

<100msQuery response on billions of points — sub-second dashboards at any scale
99.99%Ingestion reliability — Kafka + TSDB with exactly-once delivery
10-20xStorage compression — columnar + time-series encoding vs relational
4 TiersAutomatic lifecycle — hot → warm → cold → archive, zero manual effort
AI-ReadyParquet data lake — ML training on years of production data

432 Million Points Per Day. Sub-100ms Queries. Zero Data Loss.

iFactory designs sensor data pipelines and time-series database architectures for greenfield factories — MQTT ingestion, stream processing, TSDB deployment, storage tiering, query optimization, and AI data lakes — scaled to your sensor count and operational from commissioning day.

Frequently Asked Questions

Which time-series database for a factory?
It depends on your query patterns and team skills. InfluxDB 3.0 is best for pure sensor ingestion with maximum write throughput — its FDAP stack (Flight, DataFusion, Arrow, Parquet) handles millions of points per second with efficient columnar compression. Choose InfluxDB if your primary workload is ingesting IoT telemetry and serving time-range dashboards. TimescaleDB is best when you need SQL compatibility and relational joins — correlating sensor data with production orders, maintenance records, or quality data in the same query. It's a PostgreSQL extension, so your team's SQL skills apply directly and you get the entire PostgreSQL ecosystem (PostGIS, BI tools, ETL connectors). Choose TimescaleDB if your analytics team is SQL-native. ClickHouse is best for complex analytical queries at massive scale — think ad-hoc analysis across billions of rows with sub-second response. Choose ClickHouse if your primary workload is batch analytics and reporting rather than real-time dashboards. Many factories use two: InfluxDB or TimescaleDB for operational dashboards, ClickHouse for deep analytics.
How much storage for 10,000 sensors?
At 1-second sampling with 10,000 sensors: 864 million data points per day. After time-series compression (10-20x), raw storage at full resolution: approximately 3-5 GB per day, or 100-150 GB per month. With 4-tier storage: Hot (30 days full resolution): 100-150 GB on NVMe SSD. Warm (11 months at 1-minute resolution): 50-80 GB on SSD. Cold (years at hourly resolution): 5-10 GB per year on object storage. Total first-year storage: approximately 200-300 GB — surprisingly modest thanks to time-series compression and downsampling. The real storage driver is vibration waveforms: 50 vibration sensors at 25 kHz generate 100x more data than 10,000 process sensors at 1 Hz. Design the vibration data pipeline separately with edge FFT processing to reduce the raw waveform volume by 1,000x before central storage.
Kafka vs MQTT for the data pipeline?
MQTT and Kafka serve different roles in the pipeline — they're not alternatives. MQTT is the transport protocol between sensors/edge gateways and the factory's message infrastructure. It's lightweight (2-byte header), supports QoS levels, and is designed for constrained devices. Kafka is the stream processing backbone — a durable, replayable, horizontally scalable message queue that sits between MQTT and the TSDB. The architecture: sensors publish to MQTT broker → Kafka Connect (or a bridge) consumes from MQTT and writes to Kafka topics → stream processors (Kafka Streams, Flink) consume from Kafka, validate/transform/aggregate → write to TSDB. For smaller factories (<5,000 sensors), MQTT broker with direct TSDB write (via Telegraf) is simpler and sufficient. Kafka adds value at scale (>50,000 messages/sec) or when you need replayability (reprocessing historical data when a bug is found in the transformation logic).
How long should we retain sensor data?
Retention depends on the use case, not a single policy. Design four retention tiers: Full resolution (1-second): 7-30 days for real-time dashboards, anomaly detection, and shift-level troubleshooting. 1-minute aggregates: 90-365 days for trending, monthly reports, and short-term pattern analysis. Hourly aggregates: 2-5 years for annual reporting, equipment lifecycle tracking, and regulatory compliance. Daily summaries: forever (or 10+ years) for long-term asset records and historical benchmarking. The key principle: never delete data — downsample it. A 1-second reading downsampled to 1-minute retains the min, max, average, and count — preserving the statistical characteristics while reducing storage 60x. Most regulatory requirements (OSHA, FDA, EPA) specify retention periods for process records — design the cold tier to match your compliance requirements. Schedule a demo to see our retention policy calculator for your specific sensor count and compliance requirements.
Cloud vs on-premise for factory TSDB?
Hybrid is the answer for most US manufacturers. On-premise edge TSDB for real-time operations: hot tier data (last 7-30 days) stored on-site for sub-100ms dashboard response and zero dependency on internet connectivity. If your WAN link goes down, your dashboards and alerts keep working. Cloud TSDB for long-term analytics and AI: warm/cold/archive tiers in InfluxDB Cloud, Timescale Cloud, or ClickHouse Cloud for historical analysis, cross-plant comparison, and ML training. Cloud provides elastic compute for burst analytics workloads and eliminates on-premise storage management. Data flows one way: edge → cloud via MQTT or Kafka with TLS encryption. No cloud → edge data flow for operational control. This architecture meets IEC 62443 OT security requirements and NIST cybersecurity framework guidance for US manufacturers. For defense/ITAR facilities: fully on-premise with air-gapped cold storage — no cloud connectivity.

Your Sensors Are Talking at 432 Million Points Per Day. Is Anyone Listening?

Purpose-built time-series databases with optimized pipelines handle the volume, velocity, and variety of factory sensor data — at 10-20x less storage cost and 100x faster queries than relational databases.


Share This Story, Choose Your Platform!