Master the DP-700

An interactive study guide built on 7 memory techniques to help you pass the Microsoft Fabric Data Engineer Associate exam.

Implement 30-35% Ingest 30-35% Monitor 30-35%

What it covers

Lakehouses, Spark notebooks (PySpark/SQL), data pipelines, mirroring, Delta Lake, medallion architecture (bronze/silver/gold), and monitoring & optimisation.

Ideal for

Data engineers, ETL developers, platform engineers, and anyone building data pipelines and lakehouses in Microsoft Fabric.

Aspire to this if

You're a SQL developer or data analyst wanting to move into data engineering, or an Azure data engineer transitioning to Fabric.

Section 1 / Spatial Memory

The Map

An interactive architecture diagram of Microsoft Fabric data engineering. Click any node to explore key exam facts.

30-35%

Implement &
Manage

30-35%

Ingest &
Transform

30-35%

Monitor &
Optimise

Section 2 / Narrative Memory

The Story

Follow data on its engineering journey through Microsoft Fabric — from raw ingestion to business-ready gold.

🏢

The Foundation

Everything begins with OneLake — the unified data lake that underpins all of Microsoft Fabric. Built on ADLS Gen2, every piece of data automatically lands here. Shortcuts extend its reach across clouds without copying a single byte.

Exam Intel OneLake = one lake per tenant. Built on ADLS Gen2. All Fabric items store data automatically. Shortcuts: zero-copy access to internal/external sources (ADLS, S3, GCS, Dataverse).

🔧

The Intake

Data flows in through multiple channels. Pipelines orchestrate bulk movements with Copy Activity and its 170+ connectors. Dataflows Gen2 provide no-code Power Query transformations. For heavy engineering, Spark notebooks run PySpark at scale.

Exam Intel Pipeline activities: Copy, Dataflow, Notebook, Stored Procedure, ForEach, If Condition, Web. Triggers: schedule, tumbling window, event-based. Dataflows Gen2 = Power Query Online with lakehouse staging.

🔄

The Mirror

Some data doesn't need pipelines at all. Mirroring creates near-real-time replicas of external databases — Azure SQL, Cosmos DB, Snowflake — as read-only Delta tables in OneLake. Change Data Capture ensures only differences flow, not full copies.

Exam Intel Mirroring = near real-time CDC replication. No ETL needed. Creates read-only Delta tables. Sources: Azure SQL, Cosmos DB, Snowflake, PostgreSQL, MySQL. Accessible via SQL endpoint + Spark.

🥉

The Bronze Landing

Raw data touches down in the Bronze layer — the first stop in the medallion architecture. Data arrives as-is from source systems. Engineers add metadata columns (ingestion time, source, batch ID) but resist the urge to transform. Schema-on-read rules here.

Exam Intel Bronze = raw landing zone. Minimal transformation. Add metadata columns: _ingestion_timestamp, _source_system, _batch_id. Use Delta tables for ACID + time travel. Files section for unstructured data.

🥈

The Silver Refinery

In the Silver layer, data gets scrubbed. Nulls are handled, types are cast, duplicates are removed. Column names are standardised, dates normalised. Related tables join together. This is where data quality rules enforce consistency.

Exam Intel Silver = cleansed, conformed. Key operations: null handling, type casting, deduplication, column rename, date standardisation, FK resolution. Use watermark/change tracking for incremental loads.

🥇

The Gold Vault

Gold is where data becomes a product. Aggregations are pre-computed, star schemas are formed, and partitioning strategies align with downstream query patterns. Z-Order optimisation ensures filter columns are read-efficient.

Exam Intel Gold = business-ready, aggregated. Star schema (facts + dimensions). Pre-computed aggregates. Partition by date/region. Z-Order for filter columns. Serves semantic models, dashboards, APIs.

⚡

The Spark Engine

Under the hood, Spark does the heavy lifting. V-Order optimisation on writes ensures downstream reads are fast. mssparkutils handles file operations and credential management. High Concurrency mode shares sessions across notebooks for efficiency.

Exam Intel V-Order = write-time optimisation for faster reads. mssparkutils: fs, notebook.run, credentials. High Concurrency mode = shared Spark session. Autoscale pools. Runtime versions matter — pin for stability.

📊

The Serving Layer

Data reaches consumers through two paths: the Lakehouse (Spark + SQL analytics endpoint) and the Warehouse (full T-SQL). Direct Lake semantic models read Delta tables without copying, while the SQL analytics endpoint provides familiar T-SQL access.

Exam Intel Lakehouse: Spark + auto SQL analytics endpoint (read-only). Warehouse: full T-SQL DML, multi-table transactions. Direct Lake: reads Delta directly, no import copy needed. Cross-database queries in warehouse.

🔍

The Monitor's Tower

Engineers watch from the monitoring tower. The Monitoring Hub tracks pipeline runs, Spark jobs, and refresh operations. Capacity metrics reveal compute usage. Apache Spark Advisor suggests performance improvements.

Exam Intel Monitoring Hub: pipeline runs, Spark jobs, dataflow refreshes. Capacity metrics: CU usage, throttling, overload. Spark Advisor: performance recommendations. Spark UI: stage details, DAG, executors.

🛡️

The Governance Gate

Before data leaves, governance ensures it's trustworthy. Fabric domains organise workspaces by business area. Sensitivity labels flow downstream automatically. OneLake RBAC controls who accesses what, while Purview catalogues everything.

Exam Intel Domains: organise workspaces by business area. Sensitivity labels propagate downstream. OneLake RBAC: item-level + folder-level permissions. Purview integration for data catalogue + classification. Data lineage end-to-end.

Section 3 / Acronym Memory

Mnemonic Wall

Memorable acronyms and phrases to anchor key exam concepts in your memory.

🥇

BSG

Bronze, Silver, Gold

Medallion architecture layers. Bronze = raw. Silver = cleansed. Gold = business-ready.

🔧

CDNS

Copy, Dataflow, Notebook, Stored Proc

The 4 main pipeline activities for data movement and transformation.

📓

VOOM

V-Order, OPTIMIZE, Order by Z, Maintain

Delta table performance stack. V-Order on write, OPTIMIZE to compact, Z-Order to sort, maintain regularly.

🔄

CDC

Change Data Capture

How mirroring works — only syncs changes, not full copies. Also used in watermark patterns for incremental loads.

🏛️

Lakehouse (Spark) vs Warehouse (T-SQL)

Two storage options. Lakehouse = schema-on-read, Spark-first. Warehouse = schema-on-write, T-SQL, multi-table tx.

⚡

VOD

VACUUM, OPTIMIZE, DESCRIBE HISTORY

Delta maintenance trinity. VACUUM cleans old files. OPTIMIZE compacts small files. DESCRIBE HISTORY for audit.

🔗

SAGE

Shortcuts: ADLS, GCS, External S3

OneLake shortcut sources for cross-cloud zero-copy access. Also internal + Dataverse.

📊

Direct Lake

The optimal semantic model mode for Fabric. Reads Delta directly from OneLake. No data copy. Near-real-time.

🔍

MSA

Monitoring Hub, Spark UI, Advisor

Three monitoring tools. Hub = pipeline/job status. Spark UI = stage DAG. Advisor = perf recommendations.

🛡️

DSLP

Domains, Sensitivity labels, Lineage, Purview

Governance pillars. Domains organise. Labels protect. Lineage traces. Purview catalogues.

📦

MIST

Metadata columns, Ingestion timestamp, Source system, batch Tag

What to add in the Bronze layer. These columns enable lineage and debugging.

🔧

SET

Schedule, Event, Tumbling window

Pipeline trigger types. Schedule = cron-like. Event = storage events. Tumbling window = fixed intervals.

🧹

QFCD

Quality, Folding, Connectors, Destination

Dataflows Gen2 key concepts. Column Quality/Distribution/Profile. Query folding. 170+ connectors. Multiple outputs.

🏔️

RBAC

Role Based Access Control

OneLake RBAC: item-level and folder-level permissions. Workspace roles: Admin > Member > Contributor > Viewer.

🔬

HCS

High Concurrency Session

Shares Spark session across notebooks. Reduces startup time. Good for interactive development.

Section 4 / Contrast Memory

Aspect	Full Load	Incremental Load
Data scope	All data every time	Only new/changed records
Performance	Slower, resource-heavy	Faster, efficient
Complexity	Simple implementation	Needs watermark/CDC tracking
Storage	Replace or overwrite	Merge/upsert (MERGE INTO)
Use case	Small tables, initial load	Large tables, ongoing sync
Fabric pattern	Overwrite mode	Watermark + merge pattern

Click to flip back

Section 5 / Visual Grouping

The Cheat Sheet

A dense four-column reference grid — one column per exam domain.

Implement & Manage

30-35%

Lakehouse & Warehouse

Lakehouse: schema-on-read, Spark + SQL analytics endpoint (read-only)
Warehouse: schema-on-write, full T-SQL DML, multi-table transactions
Both store Delta Parquet in OneLake
Cross-database queries supported in warehouse
Default semantic model auto-created on lakehouse tables

Medallion Architecture

Bronze: raw landing, metadata columns, schema-on-read
Silver: cleansed, conformed, deduplicated, joined
Gold: aggregated, star schema, partition-optimised
Each layer = separate lakehouse (recommended)

Lifecycle & DevOps

Git integration: Azure DevOps or GitHub
Deployment pipelines: Dev → Test → Prod
Notebooks as source files (.py/.sql default)
Database projects for warehouse schema management
Environment variables for stage-specific config

Security & Governance

Workspace roles: Admin > Member > Contributor > Viewer
OneLake RBAC: item-level + folder-level permissions
RLS, CLS, OLS for data-level security
Dynamic data masking in warehouse
Sensitivity labels propagate downstream
Fabric domains: group workspaces by business area
Endorsement: Promoted → Certified

Ingest & Transform

30-35%

Pipeline Activities

Copy Activity: 170+ connectors, bulk data movement
Dataflow Gen2: Power Query Online, no-code ETL
Notebook Activity: run Spark notebooks in pipelines
Stored Procedure: execute SQL in warehouse
ForEach / If Condition: control flow logic

Spark Notebooks

PySpark, Spark SQL, Scala, R
mssparkutils: fs, notebook.run, credentials
V-Order: auto write-time optimisation
High Concurrency: shared Spark session
Broadcast: small DataFrame to all nodes

Ingestion Patterns

Full load: overwrite, simple, resource-heavy
Incremental: watermark + MERGE INTO (upsert)
CDC: change data capture for deltas only
Mirroring: near real-time database replication
Shortcuts: zero-copy cross-source access

Dataflows Gen2

Power Query Online, query folding (web UI icons)
Column Quality / Distribution / Profile
Multiple output destinations (vs Gen1 PBI only)
Lakehouse staging for better performance

Design & Build

30-35%

Medallion Architecture

Bronze: raw landing, metadata columns, schema-on-read
Silver: cleansed, conformed, deduplicated, joined
Gold: aggregated, star schema, partition-optimised
Each layer = separate lakehouse (recommended)

Lakehouse Design

Tables (managed): Delta format, SQL analytics endpoint
Files (unstructured): PDFs, images, raw CSVs
Shortcuts: virtual folders pointing to external data
Default semantic model auto-created on tables

Warehouse Design

Full T-SQL DML: INSERT, UPDATE, DELETE, MERGE
Multi-table transactions supported
Cross-database queries across warehouses
Visual Query Editor for no-code querying

Delta Lake

Parquet + transaction log = ACID + time travel
Schema evolution: mergeSchema, overwriteSchema
OPTIMIZE: compact small files
VACUUM: clean up old versions
Z-ORDER BY: co-locate data for filter columns

Monitor & Optimise

15-20%

Monitoring Tools

Monitoring Hub: pipeline runs, Spark jobs, refreshes
Spark UI: stage DAG, tasks, executors, storage
Apache Spark Advisor: perf recommendations
Capacity Metrics: CU usage, throttling, overload

Performance

V-Order: automatic write-time sort optimisation
OPTIMIZE: compact small files into larger ones
Z-ORDER: sort by frequently filtered columns
Partition pruning: read only needed partitions
Cache: Spark cache() for iterative queries

Delta Maintenance

VACUUM: remove unreferenced files (default 7 days)
OPTIMIZE: compact small files (bin-packing)
DESCRIBE HISTORY: audit table versions
RESTORE: roll back to previous version
ANALYZE TABLE: update statistics for query planner

Troubleshooting

Small file problem: too many small writes → OPTIMIZE
Data skew: uneven partition sizes → repartition()
Shuffle spill: insufficient memory → increase executors
Stale cache: cache invalidation after writes

Monitor & Optimise

30-35%

Monitoring Tools

Monitoring Hub: pipeline runs, Spark jobs, dataflow refreshes
Spark UI: stage DAG, tasks, executors, storage, shuffle
Apache Spark Advisor: skew detection, optimisation hints
Capacity Metrics: CU usage, throttling, overload alerts

Delta Maintenance

OPTIMIZE: compact small files (bin-packing)
VACUUM: remove unreferenced files (default 7 days retention)
Z-ORDER BY: co-locate data for filtered column reads
V-Order: automatic write-time optimisation (Fabric default)
DESCRIBE HISTORY: audit table versions
RESTORE: roll back to previous version

Performance Tuning

Partition pruning: read only needed partitions
Cache: Spark cache() for iterative queries
Broadcast joins: small DataFrame to all nodes
Avoid collect() on large datasets
Repartition for data skew (check Spark UI)

Troubleshooting

Small file problem: too many appends → OPTIMIZE
Data skew: uneven partition sizes → repartition/salting
Shuffle spill: insufficient memory → increase executors
Pipeline failures: check activity output, retry policies
Eventstream errors: check DLQ (dead letter queue)

Section 6 / Method of Loci

Memory Palace

Walk through five rooms, each representing a key area of Fabric data engineering. Objects fade in as you scroll.

The Landing Dock

Ingestion — Where data first arrives

🏢

OneLake

Unified data lake per tenant. ADLS Gen2. All Fabric data lands here

🔧

Copy Activity

170+ connectors. Bulk data movement. Full or incremental load

🧹

Dataflows Gen2

Power Query Online. No-code ETL. Query folding. Column profiling

🔄

Mirroring

Near real-time CDC replication. Azure SQL, Cosmos DB, Snowflake

🔗

Shortcuts

Zero-copy access. ADLS, S3, GCS, Dataverse, internal OneLake

📡

Eventstream

Streaming ingestion. Event Hubs, IoT Hub, custom apps

The Medallion Hall

Architecture — Where data is layered

🥉

Bronze Layer

Raw landing. Metadata columns. Schema-on-read. Delta tables

🥈

Silver Layer

Cleansed, conformed. Null handling, dedup, joins, type casting

🥇

Gold Layer

Business-ready. Star schema. Aggregates. Z-Order optimised

📓

Spark Notebooks

PySpark/SQL/Scala. mssparkutils. V-Order. High Concurrency

💧

Incremental Load

Watermark pattern. MERGE INTO for upsert. CDC for deltas

📊

Star Schema

Fact tables (measures) + Dimension tables (attributes). Gold layer pattern

The Engine Room

Compute — Where transformation happens

⚡

Spark Pools

Auto-scaling. Runtime versions. Executor memory. Broadcast joins

🔧

Pipeline Engine

Orchestration activities. ForEach, If Condition, Web, Wait

📦

Delta Operations

OPTIMIZE, VACUUM, Z-ORDER, DESCRIBE HISTORY, RESTORE

🏗️

V-Order

Write-time optimisation. Automatic in Fabric. Faster downstream reads

📐

Partitioning

Partition by date/region. Pruning skips irrelevant files. Align with queries

🔄

Schema Evolution

mergeSchema: add new columns. overwriteSchema: replace schema entirely

The Control Tower

Monitoring — Where engineers watch

🔍

Monitoring Hub

Pipeline runs, Spark jobs, dataflow refreshes. Status and duration

📊

Spark UI

Stage DAG, tasks, executors, storage, shuffle metrics

💡

Spark Advisor

Performance recommendations. Skew detection. Optimisation hints

📈

Capacity Metrics

CU usage, throttling, overload alerts. Capacity management

🐛

Small File Problem

Too many small writes → OPTIMIZE to compact. Common in streaming

⚖️

Data Skew

Uneven partitions → repartition() or salting. Check Spark UI

The Governance Vault

Security — Where trust is enforced

👥

Workspace Roles

Admin > Member > Contributor > Viewer. Coarse-grained access

🔒

OneLake RBAC

Item-level + folder-level permissions. Fine-grained data access

🏷️

Sensitivity Labels

Propagate downstream. Classify data. Compliance enforcement

🗂️

Fabric Domains

Group workspaces by business area. Organisational data mesh

🔀

Git Integration

Azure DevOps or GitHub. Notebooks as .py/.sql source files

🚀

Deployment Pipelines

Dev → Test → Prod. Deployment rules. Environment variables

Section 7 / Pattern Recognition

Pattern Spotter

Decision flowcharts and trigger-answer pattern cards for common exam questions.

Which Ingestion Method?

What's the data source / requirement?
  ├── External DB, need always-fresh replica → Mirroring
  ├── Cross-cloud, no data copy → OneLake Shortcuts
  ├── Bulk ETL from 170+ sources → Pipeline + Copy Activity
  ├── No-code Power Query transforms → Dataflows Gen2
  ├── Complex Spark transformations → Notebook Activity
  └── Real-time events / IoT → Eventstream

Which Storage Option?

What's the workload?
  ├── Spark / data engineering / ML → Lakehouse
  ├── T-SQL / multi-table transactions → Warehouse
  ├── Real-time events / KQL → Eventhouse
  ├── Unstructured files (PDFs, images) → Lakehouse Files section
  └── Cross-database T-SQL queries → Warehouse

Which Delta Maintenance Command?

What do you need to do?
  ├── Too many small files → OPTIMIZE
  ├── Reclaim storage from old versions → VACUUM
  ├── Speed up filtered reads → Z-ORDER BY (column)
  ├── Check table version history → DESCRIBE HISTORY
  ├── Roll back a bad write → RESTORE TABLE ... TO VERSION
  └── Update query planner stats → ANALYZE TABLE

Which Loading Pattern?

How should data be loaded?
  ├── Small reference table → Full load (overwrite)
  ├── Large fact table, ongoing sync → Incremental (watermark + MERGE)
  ├── DB with change tracking → CDC-based incremental
  ├── External DB, always-fresh needed → Mirroring (automatic)
  └── Cross-cloud, zero storage cost → Shortcuts (no load needed)

Trigger → Answer Patterns

"medallion architecture" or "bronze/silver/gold"

→ Lakehouse layered design

"V-Order" or "write-time optimisation"

→ Fabric auto read optimisation

"mssparkutils" or "notebook.run"

→ Spark notebook utilities

"High Concurrency mode"

→ Shared Spark session across notebooks

"MERGE INTO" or "upsert"

→ Incremental load pattern

"near real-time replication" from external DB

→ Mirroring (CDC-based)

"zero-copy" or "shortcuts"

→ OneLake Shortcuts

"small file problem"

→ OPTIMIZE (compact files)

"Z-ORDER" or "data skipping"

→ Column co-location for filter perf

"VACUUM" or "reclaim storage"

→ Remove old Delta file versions

"schema evolution" or "mergeSchema"

→ Delta Lake schema flexibility

"Eventstream" or "real-time ingestion"

→ Streaming data into Eventhouse

"Fabric domains"

→ Organise workspaces by business area

"OneLake RBAC" or "folder-level security"

→ Fine-grained data access control

"deployment pipelines" for Fabric

→ Dev → Test → Prod promotion

"watermark pattern"

→ Incremental load tracking

"Data Activator"

→ Alert triggers on real-time patterns

"time travel" or "DESCRIBE HISTORY"

→ Delta Lake version history

Ready to certify?

Train with practitioners, not presenters

Lucid Labs delivers Microsoft certification training grounded in real-world project experience. We adapt every session to your team's environment, data stack, and business objectives — because the best exam prep comes from engineers who build these solutions every day.

🎯

Tailored Content

Training built around your actual data, your tools, and your use cases — not generic slides.

🛠️

Hands-On Labs

Work through real scenarios in your own environment with expert guidance at every step.

📈

Exam + Capability

Pass the exam and build lasting skills your team can apply from day one.

Talk to us about Fabric Data Engineering training

Custom training for teams & individuals — remote or on-site across Australia

Keith Oak

Director & Principal Consultant — Lucid Labs

Microsoft Solutions Partner architect specialising in Fabric, Azure Data & AI, and GitHub Enterprise. 18+ years delivering data platforms for Australian businesses — building the systems these exams test every day.

LinkedIn ↗ lucidlabs.com.au ↗ Published 29-03-2026