FranHill Pharma MS Fabric Transformation

01 Executive Summary

The pharmaceutical industry is navigating a period of unprecedented digital flux. FranHill Pharma, a mid-sized retail and wholesale entity with 150+ locations, faced "data paralysis" caused by fragmented silos and delayed reporting. This transition to Microsoft Fabric details the shift from rigid, on-premises ETL processes to a modern, lake-centric environment using OneLake.

02 Regulatory Context (DSCSA)

The Drug Supply Chain Security Act (DSCSA) fundamentally transformed pharmaceutical logistics, requiring federal compliance to track and authenticate products end-to-end. FranHill Pharma faces multi-layered regulations including state-level mandates, GxP compliance, HIPAA, and international standards.

Core DSCSA Requirements

Electronic Interoperability

Standardized EPCIS messaging across supply chain.

Ongoing

Package-Level Tracing

Individual unit tracking vs. batch/shipment.

Jan 2023

Serialization (REMS)

Unique serial numbers & authenticity validation.

Phase (Nov 2023+)

Transaction History

Every party-to-party transfer with timestamps & lots.

Jan 2023

Product Traceability

Query-able data model for end-to-end lineage.

Jan 2023

Data Retention

6-year immutable storage of all transactions.

Ongoing

FDA Enforcement Actions

Warning letters for incomplete transaction data
Product seizures for failed serialization
Mandatory recalls on counterfeiting suspicion

Financial & Operational Impact

$10K+ per violation; $100K-$10M+ for recalls
Loss of distribution licenses & DEA registration
Market share loss & reputation damage

Extended Regulatory Scope

State-Level Rules

CA, FL, NY stricter than DSCSA minimums.

GxP Compliance

GDP/GMP audit trails & validated systems.

HIPAA & Privacy

Encrypt & segregate patient prescription data.

International

EU GDP & Canada CDSA alignment needed.

03 Functional Deficiencies

Inventory Blind Spots

Stockouts despite excess stock in neighboring regions.

Delayed Reporting

3-5 day lag in inventory reports.

Data Quality Issues

Inconsistent formats and duplicates across systems.

Manual Processes

High error rates from manual data entry.

Compliance Gaps

Difficulty tracking DSCSA serialization.

Lack of Analytics

No real-time demand forecasting.

System Silos

Fragmented legacy systems unable to communicate.

Scalability Issues

On-prem infrastructure maxed out capacity.

Poor Visibility

No end-to-end product tracking capability.

Inadequate Security

Limited audit trails and access controls.

Cost Inefficiency

Redundant storage and maintenance costs.

Limited Real-time Insights

Batch-based reporting hampers quick decisions.

04 Technical Constraints

Rigid ETL Pipelines

SQL agent jobs fail on unit-level scanning volume.

High TCO

Redundant storage across silos; excess costs.

Legacy On-Prem Infrastructure

Fixed capacity; cannot scale elastically.

Network Bottlenecks

Data movement between silos limited by bandwidth.

Incompatible Data Formats

Disparate schemas across pharmacy systems.

Limited Security Controls

Inadequate audit trails and encryption.

Slow Batch Processing

Nightly jobs delay decision-making.

Schema Rigidity

Hard to adapt to new data requirements.

Data Synchronization Issues

Stale data across federated systems.

Resource Constraints

Limited skilled DBAs and data engineers.

Poor Disaster Recovery

Unreliable backups and slow RTO/RPO.

05 OneLake Architecture

By centralizing data into a single logical lake, FranHill Pharma eliminated the "copy-storm" anti-pattern. Every engine (Spark, SQL, PowerBI) accesses the same physical files with ACID compliance and unified governance.

Zero Copy

Single source of truth; no data duplication.

ACID Delta

Transactions with ACID guarantees on data lake.

Shortcuts

Virtual references to external cloud data.

Multi-Cloud

Azure, hybrid-cloud, and AWS integration.

Core Components & Benefits

Unified Governance

Single credential store, centralized security, consistent metadata.

Open Standards (Parquet)

Format-agnostic; compatible with Apache Spark & all analytics tools.

Unlimited Scale

Petabyte-scale storage with elastic compute.

Native Spark Integration

PySpark & R for ML pipelines without ETL overhead.

Direct Lake Mode (PowerBI)

Real-time analytics on lake data without imports.

Cost Optimization

Pay-per-use; no over-provisioning of infrastructure.

Time Travel & Versioning

Query historical snapshots; restore from point-in-time.

Data Lineage Tracking

Built-in audit trails for regulatory compliance.

Partition Pruning

Automatic query optimization on large datasets.

FranHill Pharma Impact

Eliminated 3 redundant data warehouses, reduced ETL latency from 5 days to <3 hours, and cut storage costs by 40% while enabling real-time compliance auditing.

06 Medallion Implementation

The medallion architecture progressively refines raw pharmacy data through three layers, each solving specific deficiencies: Bronze ingests messy source data, Silver enforces quality & compliance, and Gold enables analytics and real-time decision-making.

Bronze Layer

Purpose: Raw system-of-record ingestion

Problems Addressed:

Fragmented source systems (SAP, POS, RMS)
Inconsistent data formats & schemas
Untracked data lineage
Lost historical snapshots

Sources: SAP ERP, pharmacy POS, RMS

Frequency: Real-time (Connector)

Tool: Data Factory + Notebooks

Format: Parquet Delta Lake

Silver Layer

Purpose: Data quality & compliance enforcement

Problems Addressed:

Duplicate & conflicting records
HIPAA/PII exposure
Data quality inconsistencies
Missing audit trails

Transformations: Dedup, masking, validation

Tool: Apache Spark (PySpark)

Frequency: Hourly SLA

Governance: Row-level security (RLS)

Gold Layer

Purpose: Business analytics & real-time insights

Problems Addressed:

Slow batch reporting (5 days → <3 hrs)
Limited analytics capability
Inventory visibility gaps
No real-time compliance auditing

Schemas: Star/snowflake optimized

Tool: Direct Lake (PowerBI)

Frequency: Real-time streaming

Use Cases: Demand forecast, compliance dashboard

Quality Metrics by Layer

Bronze: 100% source fidelity, 0 transformations

Silver: 99.8% duplication removal, 100% HIPAA compliance

Gold: <100ms query latency, 24/7 availability (SLA 99.95%)

07 Pipeline Workflow

End-to-End Pipeline Overview

The FranHill Pharma data pipeline transforms raw, multi-sourced pharmacy transactions into enterprise-ready analytics in real-time. It ingests 500K+ daily events from 150+ locations, cleanses PII, enforces DSCSA compliance, and delivers <100ms query latency to PowerBI dashboards via Direct Lake mode.

Pipeline SLA: 99.95% uptime | <3 hours end-to-end latency | 500K events/day throughput

1

Source Ingestion (Data Factory)

WHAT

Extract raw transactions from SAP, POS systems, and RMS servers. Capture full history with metadata (source system, timestamp, record version).

WHY

Eliminates data silos by consolidating fragmented sources. Provides immutable audit trail for DSCSA compliance. Enables point-in-time recovery for regulatory audits.

HOW

Copy Activity: Connect SAP (OData), POS (ODBC), RMS (REST API) → Bronze landing zone. Frequency: Real-time via change data capture (CDC). Format: Parquet Delta Lake with transaction version control.

Sources: SAP S/4HANA (GL, MM modules) | NCR POS | RF-One RMS

Volume: 500K events/day | ~50GB/day raw data | 30-60 second ingestion lag

Error Handling: Failed ingestion triggers Slack alert; manual retry within 15 min SLA. Dead-letter queue captures malformed records for investigation.

Destination: `abfss://bronze/Inventory`, `abfss://bronze/Sales`, `abfss://bronze/Customers` (OneLake)

2

Data Cleansing & Transformation (Apache Spark)

WHAT

Execute business logic: deduplication, PII masking (HIPAA), null handling, schema validation, data type conversion, and domain-specific enrichment (e.g., NDC normalization).

WHY

Raw data contains duplicates, PII in transaction records, and inconsistent formats. Silver layer creates a "trust layer" that removes 99.8% of data quality issues, enforces HIPAA masking, and ensures all downstream analytics are compliance-ready.

HOW

Language: PySpark (Notebook-based). Operations: Window functions for dedup, SHA-256 hashing for PII, filter for nulls. Partition: By date (YYYY-MM-DD). SLA: 1-hour max latency from ingestion.

Logic Sample: dropDuplicates(["TransactionID"]) → drop invalid quantities → hash PatientID → validate NDC codes → enrich with drug master data

Performance: 500K rows processed in ~12 minutes (Spark pool: 8 cores, 32GB memory). Uses file pruning to process only new/modified Bronze records.

Key Transformations: SHA-256(PatientID) | Null filtering | Quantity validation (>0) | Date standardization | Currency conversion

Destination: `Tables/Silver_Inventory`, `Tables/Silver_Sales` (Delta format with ACID guarantees)

3

Aggregation & Business Curating (T-SQL)

WHAT

Build star schemas: Fact tables (InventoryFact, SalesFact) and Dimension tables (DimStore, DimProduct, DimDate). Pre-aggregate KPIs: inventory turnover, stockout frequency, demand variance.

WHY

Star schemas optimize analytical queries. Pre-aggregation eliminates slow ad-hoc SQL queries and enables <100ms response times in Power BI. Enables role-based access control (RLS) at the Gold layer for regulatory visibility.

HOW

Language: T-SQL views & stored procedures. Strategy: SCD Type 2 for slowly changing dimensions. Refresh: Hourly full refresh (union of Silver tables). Indexing: Clustered columnstore on fact tables.

Star Schema: FactInventory (500M rows) → DimStore (150 rows) | DimProduct (5K NDCs) | DimDate (365 rows)

Key Metrics: Turnover Ratio = SUM(Sales) / AVG(Stock) | Days of Supply | Stockout Events | DSCSA Exception Count

Partitioning: FactInventory partitioned by Store_ID & Date for query pruning

Destination: `Gold.InventoryFact`, `Gold.SalesFact` (SQL views in OneLake)

4

Analytics & Real-time Delivery (Direct Lake)

WHAT

Power BI connects directly to Gold layer via Direct Lake mode (no import). Users query star schemas in real-time. Dashboards show: inventory KPIs, demand forecasts, DSCSA compliance status, supply chain visibility.

WHY

Direct Lake eliminates the Import model bottleneck (previously 5-day refresh lag). Real-time queries on OneLake data reduce decision latency from days to seconds. Enables compliance auditors to verify supply chain integrity in <100ms.

HOW

Connection: PowerBI → OneLake data connection. Mode: Direct Lake (queries OneLake Parquet at native speeds). RLS: Database-level RLS on DimStore enforces region-level access. Cache: V-Order sorted data for 40-60% faster columnstore scans.

Query Performance: Inventory KPI dashboard loads in <500ms (vs. 5+ minutes in legacy Warehouse). DSCSA exception report runs in <2 seconds across 150 stores.

Users & Access: 200+ pharmacy managers (filtered by region via RLS). 50+ compliance auditors (read-only). C-suite dashboard aggregations (enterprise-level)

Refresh SLA: Real-time as data enters Gold layer. No scheduled refreshes needed.

Failure Handling: If Direct Lake query fails, built-in retry (3x) with exponential backoff. Fallback to cached aggregations (updated hourly)

Performance Gains

Reporting latency: 5 days → <3 hours
Query response time: 5+ min → <100ms
Data quality: 70% duplicates → 99.8% clean
Compliance audit time: 2 weeks → 1 hour
TCO reduction: 40% (3 warehouses eliminated)

Failure Scenarios & Recovery

Ingestion failure → Dead-letter queue → Manual review within 15 min
Spark job timeout → Auto-retry (max 3x) with exponential backoff
Gold layer refresh failure → Alerts data engineering team; fallback to 24h-old snapshot
PowerBI connection loss → Cached data used for <1 hour; users notified

08 Spark & SQL Logic

This section explains the data engineering logic behind the Silver and Gold layers. It shows exactly how Spark is used to cleanse and secure raw pharmacy data, and how SQL is used to build analytics-ready models that support fast Power BI queries and regulatory reporting.

1

Silver Layer: PySpark Cleansing & PII Masking

WHAT

Ingest Bronze delta files, deduplicate records, normalize fields, mask protected health information, and enforce quality rules before writing to the Silver zone.

WHY

Raw source data is noisy, inconsistent, and often contains sensitive patient identifiers. The Silver layer creates a trusted dataset for both analytics and compliance without exposing PHI.

HOW

A Spark job reads Delta Bronze files, applies windowed deduplication, hashes sensitive fields, filters invalid rows, and writes Delta Silver tables with partitioning and ACID guarantees.

# Silver Layer: Cleanse & Mask Inventory Records
from pyspark.sql.functions import col, sha2, concat_ws, row_number
from pyspark.sql.window import Window

bronze_df = spark.read.format("delta").load("abfss://bronze/Inventory")

window_spec = Window.partitionBy("TransactionID").orderBy(col("IngestTimestamp").desc())

silver_df = bronze_df.withColumn("row_num", row_number().over(window_spec)) \
    .filter(col("row_num") == 1) \
    .drop("row_num") \
    .withColumn("PatientID_Hash", sha2(concat_ws("|", col("PatientID"), col("Store_ID")), 256)) \
    .withColumn("NDC_Code", col("NDC_Code").cast("string")) \
    .filter(col("Quantity") > 0) \
    .filter(col("TransactionDate").isNotNull())

silver_df.write.format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("IngestDate") \
    .save("abfss://silver/Inventory")
                        

Key operations: deduplication, hashing, type normalization, null filtering, partitioned Delta write.

Data quality impact: removes duplicate transaction IDs, cleans entry-level pharmacy records, and ensures every Silver row is compliance-ready.

Performance: uses predicate pushdown and partition pruning to process only modified Bronze files.

2

Gold Layer: SQL Aggregation & Analytics Models

WHAT

Build star-schema views and aggregated fact tables. Expose business KPIs such as turnover ratio, stock aging, and DSCSA compliance events.

WHY

Analytics require curated datasets structured for fast filtering, slicing, and drill-down. The Gold layer turns cleaned Silver data into a business-ready semantic model.

HOW

Create materialized views and Delta tables using SQL. Apply SCD 2 logic for dimensions, compute metrics with window functions, and index with columnstore patterns for low-latency queries.

CREATE OR REPLACE VIEW Gold.InventoryTurnover AS
SELECT
  s.Store_ID,
  s.NDC_Code,
  SUM(s.Sales_Qty) AS Total_Sales,
  AVG(s.Stock_Level) AS Avg_Stock,
  CASE WHEN AVG(s.Stock_Level) = 0 THEN NULL
       ELSE SUM(s.Sales_Qty) / AVG(s.Stock_Level)
  END AS Turnover_Ratio,
  COUNT_IF(s.Is_DSCSA_Exception = 1) AS DSCSA_Exception_Count
FROM Silver.Inventory_Fact AS s
WHERE s.IngestDate >= DATE_SUB(current_date(), 30)
GROUP BY s.Store_ID, s.NDC_Code;
                        

Key SQL features: NULL-safe aggregation, conditional counts, time-window filtering, and grouping by store/product.

Query optimization: use `WHERE IngestDate >= DATE_SUB(current_date(), 30)` for partition pruning and reduce scan volume.

Result: supports Power BI dashboards with sub-second KPIs and DSCSA exception monitoring.

3

Spark & SQL Together: End-to-End Logic

What it solves

Bridging raw transaction ingestion with analytics-ready reporting, while preserving auditability and compliance in every step.

Why it matters

By combining Spark and SQL, the architecture supports both flexible transformation and fast BI consumption without duplicating data.

How it executes

Spark handles messy, high-volume processing. SQL builds the curated semantic layer. Together they enable reliable data lineage, audit trails, and interactive analytics.

Lineage: Silver tables are written with `metadata: source_system`, `ingest_timestamp`, and `record_hash` so every Gold metric can be traced back to its Bronze origin.

Governance: Spark transformations are logged in job metadata; SQL views are registered with business descriptions and owner tags.

Resilience: retry logic for Spark, schema evolution support, and fallback snapshots for Gold dashboards.

09 Engine Optimizations

A compact optimization catalog for Spark, Delta Lake, and Power BI. These techniques reduce latency, minimize storage costs, and support high concurrency for FranHill Pharma's compliance and analytics workloads.

V-Order Sorting

Sort files on high-cardinality filter keys to speed Direct Lake scans.

VertiPaq Layout

Optimize columns for Power BI’s in-memory engine and reduce query time.

Partition Pruning

Partition by date/store for fast predicate pushdown and reduced I/O.

Predicate Pushdown

Filter rows at storage-read time rather than after loading data.

File Compaction

Combine small files into larger Parquet files to reduce overhead.

Deletion Vectors

Mark deleted records without rewriting entire files for update efficiency.

Z-Order Clustering

Cluster data on multiple related columns for better multi-filter performance.

Data Skipping

Use Delta statistics to skip irrelevant data blocks during scans.

Schema Evolution

Allow column changes without breaking existing queries or consumers.

Adaptive Query Execution

Allow Spark to optimize join strategies dynamically at runtime.

Broadcast Joins

Broadcast small dimension tables to avoid costly large shuffles.

Cache Hot Tables

Cache frequently accessed Silver outputs for repeated query bursts.

Join Elimination

Rewrite logic to reduce redundant joins on large datasets.

Metadata Pruning

Use Delta metadata to prune partitions/files before execution.

Vectorized Reads

Read column blocks with vectorized engines instead of row-by-row.

Skew Mitigation

Detect and rebalance skewed partitions for even parallel execution.

Concurrency Controls

Limit simultaneous writes to reduce Delta transaction conflicts.

Materialized Views

Precompute expensive aggregations and serve them directly.

Query Hints

Force better join orders or broadcast behavior when automatic planning is insufficient.

Cost-Based Optimization

Use statistics to choose efficient physical plans for Spark SQL.

Auto-Scaling Pools

Scale Spark compute up/down automatically to match workload demand.

10 Governance & RLS

Policy-Driven Data Control

FranHill Pharma implemented a governance-first architecture that embeds security, lineage, and compliance into the Fabric lakehouse from ingestion through consumption.

Entity-based RLS applied at the Gold layer with centralized predicate functions for location, role, and audit classification.
Purview catalog and glossary drive consistent business definitions, data sensitivity labels, and automated access reviews.
Identity propagation ensures Power BI, Spark, and SQL workloads inherit Azure AD group memberships for true least-privilege access.

Azure AD Groups Data Classification Audit & Lineage RLS Enforcement

RLS Implementation Pattern

-- Dynamic RLS Example
CREATE FUNCTION tvf_StoreSecurity(@StoreKey INT)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS access_granted
WHERE EXISTS (
    SELECT 1 FROM dbo.UserStoreAccess
    WHERE UserPrincipalName() = Current_User
      AND Store_Key = @StoreKey
      AND IsActive = 1
);

CREATE SECURITY POLICY StoreFilter
ADD FILTER PREDICATE tvf_StoreSecurity(Store_Key)
ON Gold.Sales_Fact;
                        

Governance Maturity

A mature governance model enables separation of duties, with catalog owners, data stewards, and compliance auditors each confined to defined roles and review workflows.

RLS Assurance

Automated policy validation checks for every deployment verify that RLS predicates exist on sensitive tables and that access policies are not bypassed by ad hoc views.

11 HADR Strategy

Resilient Storage

OneLake metadata and Parquet assets are persisted in Zone-Redundant Storage (ZRS) to isolate availability zones and protect against rack, cluster, or zone failures.

Cross-Region Recovery

Azure region pair replication and Geo DR snapshots are configured for the primary Fabric capacity, ensuring a second recovery plane for regulatory failover and audit continuity.

Operational Guardrails

Runbooks define failover testing, backup verification, and recovery steps for both data assets and compute pools, with quarterly fire-drills validated by the operations team.

Availability Targets

RTO: < 4 hours for DSCSA audit workflows
RPO: < 15 minutes for transaction lineage and serialization workflows
99.95% platform SLA for ingestion and reporting

Failure Domains

Zone-level failures mitigated by ZRS and separate compute pools
Network transit interruptions handled by redundant ExpressRoute and VPN fallback
Policy drift controlled by automated audit pipelines and immutable snapshots

Recovery Architecture

The design separates storage, compute, and metadata into discrete recovery planes. When a primary Fabric region is unavailable, the alternate region can restore OneLake assets and rehydrate a secondary SQL endpoint with minimal manual intervention.

Validation & Observability

Monitoring dashboards capture replication health, job success rates, and readiness checks. Automated alerts trigger on missed backups, failed metadata syncs, or stale RLS policies.

12 Capacity Management

Capacity Planning & Allocation

F64 Fabric SKU provides 64 Capacity Units (CU) with intelligent bursting. Baseline allocation of 32 CU for steady-state operations, with 32 CU reserved for peak DSCSA re-indexing workloads.

• Compute Smoothing: Auto-scaling based on queue depth prevents resource contention.
• Storage Optimization: V-Order compressed Parquet reduces storage footprint by 30%.
• Workload Isolation: Separate pools for ingestion, transformation, and analytics.
• Right-Sizing Logic: Capacity calculated from peak concurrency (200 users × 2 CU/user) + ETL batch (16 CU) with 2x headroom.

Auto-Scaling Mechanisms

Dynamic scaling triggers on CPU utilization >70% or queue length >100 jobs. Bursting logic ensures overnight ETL doesn't impact daytime PBI performance.

• Horizontal Scaling: Spark clusters auto-scale from 4 to 32 nodes based on data volume.
• Vertical Bursting: Temporary CU increases for peak loads, billed per minute.
• Cooldown Periods: 15-minute stabilization windows prevent thrashing and cascading failures.
• Throttling Strategies: Rate-limiting on ingestion prevents queue explosion during peak events.

Cost Optimization

Reserved instances for predictable workloads achieve 40% savings. Pay-as-you-go for variable analytics usage, with monthly budget caps at $50K.

• Usage Analytics: Daily reports on CU consumption by workload type and business unit.
• Idle Resource Detection: Automated shutdown of unused clusters after 30 minutes saves $12K/month.
• Cost Allocation: Tag-based chargeback enforces cost ownership and behavioral change.
• Elastic Commitment Model: 1-year commitment for 32 CU baseline (35% discount) + hourly pay-as-you-go for overage.

Monitoring & Governance

Real-time dashboards track CU utilization, job performance, and SLA compliance. Alerts trigger at 80% capacity thresholds with escalation paths.

• Performance Metrics: Query latency, throughput, and error rates monitored via Prometheus + Grafana.
• Capacity Forecasting: ML models predict scaling needs 2-7 days ahead based on historical patterns.
• Compliance Checks: Automated audits ensure capacity aligns with 99.95% SLA for DSCSA audits.
• Anomaly Detection: Statistical models detect unusual CU spikes and trigger runbooks for root cause analysis.

Advanced Capacity Strategies for Scale

Workload Prioritization

Prioritize critical paths: DSCSA audits (P0) block on first 5 CU reserved, PBI interactives (P1) get 20 CU, ad-hoc (P3) share remaining 39 CU with 10ms latency penalty. Prevents resource starvation during peak compliance windows.

Multi-Tenant Isolation

Separate workspace capacity for pharmacy chains vs. corporate analytics. Prevents one tenant's runaway query from degrading others' experience. Cross-tenant dashboards query cached aggregates only.

Caching & Materialization

Aggregate tables for daily/weekly summaries pre-computed at 2 AM during low-usage windows. Power BI query cache (V-Order tuples) holds hot data for <100ms response times, eliminating re-computation.

13 Outcomes & ROI

40%

Stockout Reduction

Real-time inventory analytics reduced out-of-stock events by 40%, saving $2.4M in annual lost sales through predictive replenishment.

90%

Report Latency Savings

Direct Lake eliminated ETL refresh cycles, reducing report generation from 4 hours to 24 minutes—a 90% improvement in data freshness.

1.6s

PBI Page Load

V-Order optimized Parquet files achieved sub-2-second dashboard loads, enabling real-time decision-making for 200+ store managers.

65%

Compliance Cost Reduction

Automated DSCSA tracking and audit trails reduced manual compliance overhead by 65%, freeing resources for strategic initiatives.

Business Impact Summary

Revenue Growth: $3.2M additional sales from optimized inventory and reduced stockouts.
Operational Efficiency: 75% reduction in manual reporting tasks, enabling focus on patient care.
Risk Mitigation: Zero DSCSA violations post-implementation, avoiding $500K+ in potential fines.
User Satisfaction: 95% of pharmacy managers report improved decision confidence.

ROI Analysis

Total implementation cost: $1.8M (including Fabric licensing, training, and migration). Annual benefits: $4.6M in cost savings and revenue uplift.

156%

3-Year ROI

Payback achieved in 8 months. Projected 5-year cumulative ROI of 340% through expanded analytics capabilities.

14 12-Month Roadmap

Q1: Foundation

• Fabric capacity setup & CU allocation
• Bronze layer ingestion pipelines
• Initial governance & RLS policies
• Team training on OneLake basics

Q2: Migration

• Medallion architecture implementation
• SQL to Spark pipeline migration
• Silver layer data quality enforcement
• Power BI Direct Lake connections

Q3: Optimization

• Performance tuning & V-Order optimization
• HADR strategy & backup validation
• Advanced analytics & ML models
• User adoption & feedback integration

Q4: Innovation

• Fabric Copilot deployment
• AI-driven insights & automation
• Full compliance audit & certification
• Scale to enterprise-wide adoption

15 Strategic Lessons

Governance First

Establish RLS predicates and Purview classifications before data ingestion. Prevents compliance gaps, reduces audit costs by 60%.

Direct Lake Value

Eliminates data duplication, reduces storage by 40%, enables sub-second queries on 100TB+ datasets via native Parquet scanning.

Medallion Maturity

Bronze-Silver-Gold as quality stages. Enforce schemas at Silver, aggregate business logic at Gold for multi-tenant analytics.

Performance Mindset

V-Order/Z-Order for 10x faster queries. Partition by time for analytics, avoid over-partitioning to prevent file fragmentation.

Compliance as Code

Embed DSCSA/HIPAA logic in Spark UDFs. Ensures version-controlled, testable, auditable regulatory safeguards.

Capacity Economics

Monitor CU usage, auto-scale on queue depth. Use reserved instances for 30-50% savings over pay-as-you-go.

Team Upskilling

Transition from SQL Server to Spark/PySpark. Early training yields reliable pipelines and innovation velocity.

Change Management

Pilot OneLake's ACID guarantees. Scale via champions demonstrating 80% faster reports.

Protected Document

The Marvel of Data Engineering Transformation of Pharmacy Data

01 Executive Summary

02 Regulatory Context (DSCSA)

Core DSCSA Requirements

Electronic Interoperability

Package-Level Tracing

Serialization (REMS)

Transaction History

Product Traceability

Data Retention

FDA Enforcement Actions

Financial & Operational Impact

Extended Regulatory Scope

State-Level Rules

GxP Compliance

HIPAA & Privacy

International

03 Functional Deficiencies

Inventory Blind Spots

Delayed Reporting

Data Quality Issues

Manual Processes

Compliance Gaps

Lack of Analytics

System Silos

Scalability Issues

Poor Visibility

Inadequate Security

Cost Inefficiency

Limited Real-time Insights

04 Technical Constraints

Rigid ETL Pipelines

High TCO

Legacy On-Prem Infrastructure

Network Bottlenecks

Incompatible Data Formats

Limited Security Controls

Slow Batch Processing

Schema Rigidity

Data Synchronization Issues

Resource Constraints

Poor Disaster Recovery

05 OneLake Architecture

Zero Copy

ACID Delta

Shortcuts

Multi-Cloud

Core Components & Benefits

Unified Governance

Open Standards (Parquet)

Unlimited Scale

Native Spark Integration

Direct Lake Mode (PowerBI)

Cost Optimization

Time Travel & Versioning

Data Lineage Tracking

Partition Pruning

FranHill Pharma Impact

06 Medallion Implementation

Bronze Layer

Silver Layer

Gold Layer

Quality Metrics by Layer

07 Pipeline Workflow

End-to-End Pipeline Overview

Source Ingestion (Data Factory)

WHAT

WHY

HOW

Data Cleansing & Transformation (Apache Spark)

WHAT

WHY

HOW

Aggregation & Business Curating (T-SQL)

WHAT

WHY

HOW

Analytics & Real-time Delivery (Direct Lake)

WHAT