The Story
Follow data on its engineering journey through Microsoft Fabric โ from raw ingestion to business-ready gold.
The Foundation
Everything begins with OneLake โ the unified data lake that underpins all of Microsoft Fabric. Built on ADLS Gen2, every piece of data automatically lands here. Shortcuts extend its reach across clouds without copying a single byte.
The Intake
Data flows in through multiple channels. Pipelines orchestrate bulk movements with Copy Activity and its 170+ connectors. Dataflows Gen2 provide no-code Power Query transformations. For heavy engineering, Spark notebooks run PySpark at scale.
The Mirror
Some data doesn't need pipelines at all. Mirroring creates near-real-time replicas of external databases โ Azure SQL, Cosmos DB, Snowflake โ as read-only Delta tables in OneLake. Change Data Capture ensures only differences flow, not full copies.
The Bronze Landing
Raw data touches down in the Bronze layer โ the first stop in the medallion architecture. Data arrives as-is from source systems. Engineers add metadata columns (ingestion time, source, batch ID) but resist the urge to transform. Schema-on-read rules here.
The Silver Refinery
In the Silver layer, data gets scrubbed. Nulls are handled, types are cast, duplicates are removed. Column names are standardised, dates normalised. Related tables join together. This is where data quality rules enforce consistency.
The Gold Vault
Gold is where data becomes a product. Aggregations are pre-computed, star schemas are formed, and partitioning strategies align with downstream query patterns. Z-Order optimisation ensures filter columns are read-efficient.
The Spark Engine
Under the hood, Spark does the heavy lifting. V-Order optimisation on writes ensures downstream reads are fast. mssparkutils handles file operations and credential management. High Concurrency mode shares sessions across notebooks for efficiency.
The Serving Layer
Data reaches consumers through two paths: the Lakehouse (Spark + SQL analytics endpoint) and the Warehouse (full T-SQL). Direct Lake semantic models read Delta tables without copying, while the SQL analytics endpoint provides familiar T-SQL access.
The Monitor's Tower
Engineers watch from the monitoring tower. The Monitoring Hub tracks pipeline runs, Spark jobs, and refresh operations. Capacity metrics reveal compute usage. Apache Spark Advisor suggests performance improvements.
The Governance Gate
Before data leaves, governance ensures it's trustworthy. Fabric domains organise workspaces by business area. Sensitivity labels flow downstream automatically. OneLake RBAC controls who accesses what, while Purview catalogues everything.