Data Platform Engineer

Building reliable data infrastructure for AI, analytics, and scientific workflows.

This site collects field notes from that boundary: correctness, runtime, data ownership, and system design in practice, with a growing focus on reliable data infrastructure for analytics and AI systems.

Read case study Browse notes

FEATURED CASE STUDY

Stabilizing a Data Pipeline Migration Under Changing Conditions

Long validation cycles, output mismatches, and limited production visibility made the migration less about moving code and more about making each mismatch explainable.

Read →

Engineering Themes

Where these field notes focus.

Runtime Reliability

Debugging production workflows where correctness, reruns, and operational cost interact.

Data Contracts

Turning implicit assumptions in support files, schemas, and runtime artifacts into visible contracts.

Spark Execution Boundaries

Understanding where Python, Spark-native execution, skew, and materialization costs meet.

Migration & Modernization

Moving legacy workflows toward cloud-native, version-controlled, and observable data systems.

Notes

Recent notes on production data movement, Spark migration boundaries, and rewrite validation.

Engineering Note · Data Movement Reliability

From Green Runs to Trustworthy Data Movement

A production data pipeline could run green, but green did not always mean data had moved.

Engineering Note · Spark Rewrite & Validation

When Final Output Diff Is Not Rewrite Diff

Freezing input boundaries to separate rewrite behavior from upstream identity drift.

Engineering Note · Spark Migration Semantics

Shape Parity Is Not Semantic Parity

How null-safe arithmetic exposed the difference between matching row counts and matching meaning.