
Key takeaways
Short answer: A data lake stores raw data in any format with schema applied at read time. A data warehouse stores curated structured data with schema applied at write time. Manufacturing typically needs both: a time-series store for operational OEE data, a warehouse for reporting, and a lake for ML training and exploratory analysis. Trying to do everything in one or the other produces either slow reports or expensive storage. See also Manufacturing Data Quality Audit.
A data lake is bulk storage for raw data — sensor streams, logs, images, video, structured tables, JSON documents. Schema is applied at read time. Examples:
Lakes are cheap per TB and flexible. They are not optimized for SQL queries on structured data.
A data warehouse stores structured data with schema applied at write time. Curated, modeled, indexed for query performance. Examples:
Warehouses are optimized for analytic SQL. They are more expensive per TB and require schema discipline.
| Property | Lake | Warehouse |
|---|---|---|
| Schema | On read | On write |
| Data types | Any | Structured |
| Cost per TB | Low | Higher |
| Query speed (structured) | Slow without help | Fast |
| Best for | ML, exploration | BI, reporting |
Most manufacturing operations need three tiers:
1. Operational tier (time-series database). PLC tags, sensor data, real-time OEE computation. Sub-second latency. InfluxDB, TimescaleDB, AVEVA PI.
2. Reporting tier (data warehouse). Aggregated OEE, MTBF, MTTR by line, by SKU, by shift. BI dashboards. Snowflake, BigQuery, Redshift.
3. Analytics tier (data lake). Raw sensor streams, images, video, contextual data. ML training, exploratory analysis. S3, ADLS.
Data flows from operational to reporting (aggregated) and from operational to lake (raw, for later use).
1. Single tier for everything. Time-series databases struggle as warehouses; warehouses struggle as time-series stores; lakes struggle as both.
2. Lake without governance. Becomes a data swamp — nobody knows what is there or how to use it.
3. Warehouse without raw data archive. Once aggregated, raw context is lost. Future ML training cannot reconstruct.
4. Lake-first strategy. Dumping everything in a lake without an operational tier means OEE reporting cannot work in real time.
Lakehouses (Databricks, Snowflake hybrid, Iceberg / Delta Lake on object storage) try to combine lake flexibility with warehouse performance. For mature deployments these are increasingly attractive — one less tier to maintain.
For most plants, a clean three-tier setup is still easier to operate than a single lakehouse trying to do everything.
1. Skipping the analytics tier. No raw archive means no ML training data later.
2. Skipping the reporting tier. Querying time-series for BI reports is slow and expensive.
3. Tight coupling between tiers. Pipelines should be loosely coupled so each tier can evolve independently.
4. No data steward. Without ownership, both lake and warehouse decay.
A modern OEE platform owns the operational tier and integrates with warehouse and lake at the seam. The platform stores time-series data for real-time OEE, exports aggregates to the warehouse, and archives raw data to the lake.
Fabrico's OEE module owns the operational tier with native time-series storage, exports aggregates to standard warehouses (Snowflake, BigQuery), and archives raw data to object storage for ML and exploratory use.
See how Fabrico captures this automatically — explore OEE for manufacturing or book a demo.
Most production plants benefit from both. Small operations may get by with just a warehouse and time-series store.
In principle yes, in practice the technology is still maturing. Three-tier setups are more battle-tested.
Historian is the operational tier (time-series database). Lake and warehouse sit above it.
For aggregated data, yes. For raw sensor streams or image data, the lake is more practical.
As much as is affordable. Lakes are cheap; future use cases for old data are unpredictable.