Engineering Note · Spark Rewrite & Validation

When Final Output Diff Is Not Rewrite Diff

Freezing input boundaries to separate rewrite behavior from upstream identity drift.

Details are generalized and sanitized to preserve confidentiality while keeping the engineering lesson accurate.

A final-output diff tells you that two runs disagree.

It does not tell you where the disagreement entered.

That matters during rewrites. When a pipeline has recently changed, the natural suspect is the new code path. In this case, that meant a Spark-native rewrite of a grouped aggregation.

The final output did not match the expected shape, so the first question was obvious:

Did the rewrite change the result?

The final table alone could not answer that.

Freeze the input boundary

The useful move was to materialize the input right before the suspected aggregation, then run both paths against that same frozen input:

same input snapshot
↓
legacy aggregation path
spark-native aggregation path
↓
materialized outputs
↓
parity checks

Under that boundary, the legacy and Spark-native outputs matched.

That changed the investigation. The final output diff was real, but the rewrite was not the cause. The difference had entered upstream.

This is the point of frozen-input parity: it turns a broad output mismatch into a narrower question. If two implementations match on the same materialized input, the operator is no longer the main suspect.

Literal equality is not logical identity

The upstream issue was an identity problem.

A source identifier could appear in multiple literal forms:

decoded
single-encoded
double-encoded

Logically, those values referred to the same thing. Physically, they were different strings.

That mattered because a deduplication step used a key that did not include the field whose representation varied. The dedupe was not wrong with respect to its key. But when several representation variants existed for the same logical entity, the surviving variant could differ across runs.

Later, that same field became part of the downstream aggregation grouping set.

So the output changed not because the aggregation rewrite was wrong, but because upstream processing allowed different literal representations of the same logical identifier to survive into grouping.

The instability came from composition:

encoded / decoded variants
+
dedupe key that omitted the varying field
+
downstream grouping that used that field
=
rerun-idempotency drift

One-pass smoke tests were not enough

The first smoke test checked whether a decoding function existed and handled simple inputs.

Useful, but insufficient.

A single decode pass can handle already-decoded and single-encoded values. A double-encoded value may still remain partially encoded after one pass.

The production fix needed the composed behavior:

trim
↓
decode once if encoding is present
↓
decode again only if a still-encoded pattern remains
↓
fallback to the original value if decoding fails

That bounded second pass handled the documented decoded, single-encoded, and double-encoded forms without turning the logic into an unbounded decode loop.

Testing a primitive is not the same as testing the production composition.

Fix the right boundary

The fix was not to change the downstream aggregation.

The fix was to canonicalize the identifier before deduplication.

That placement mattered. Once dedupe selected a survivor, downstream grouping had already inherited that representation. Canonicalizing later would be too late.

The safer boundary was earlier:

source rows
↓
canonicalize identifier representation
↓
deduplicate
↓
aggregate

After canonicalization, decoded, single-encoded, and double-encoded variants collapsed into the same representation before dedupe. The downstream aggregation no longer depended on which literal variant happened to survive.

The broader lesson

A final-output diff is a symptom, not a root cause.

The validation pattern I would use again:

start with the materialized final output
if it differs, freeze the input before the suspected operator
compare legacy and new paths on that same input
if they match, move the boundary upstream
inspect identity, dedupe, grouping, and canonicalization assumptions

The core question is not:

Which operator is wrong?

It is:

At which boundary did two logically identical things become physically different?

In this case, the rewrite was not the diff. The diff came from identity drift upstream.