Field Notes

Shape Parity Is Not Semantic Parity

Same group keys and row counts can still mean different derived values after a migration rewrite.

Details are generalized and sanitized to preserve confidentiality while keeping the engineering lesson accurate.

In production, the job completed successfully.

A very large input had already been aggregated into a few hundred group-level rows. That part made sense.

The next step did not.

After the downstream view applied its filters, only about one-tenth of those rows remained.

Nothing had crashed. The filters were working as defined. But the drop-off was too concentrated to treat as ordinary filtering.

I used the filter as a diagnostic boundary, not as the explanation.

The filter was a symptom

One early explanation was simple: the view was applying its filters correctly.

That was true, but incomplete.

The view kept only rows where key measures were non-null and non-zero. That explained where the rows disappeared. It did not explain why so many rows had become ineligible at once.

The real question was upstream:

Which derived value changed in a way that made otherwise valid groups fail the filter?

Shape parity was not enough

The base aggregation produced the same group-row count before and after the fix. The view applied the same filter contract in both cases. The view definition did not change.

But the output did:

Check	Before fix	After fix
Base group rows	unchanged	unchanged
Rows with `NULL` derived total	~90%	0
Filtered view output	~10% of base	100% of base
Rows dropped by other filters	0	0

The recovery mattered because of what it proved: the base shape was stable, but the derived-value semantics had changed.

The mismatch was in missing-value semantics

I did not start by knowing the full business meaning of the nullable component.

I started with the data path:

base aggregation rows were present
↓
derived total was NULL for most affected rows
↓
the view filtered out NULL totals
↓
the filtered output collapsed

In the affected groups, the primary component was populated. The optional component was NULL.

The key question was:

Did this NULL mean “unknown,” or did it mean “no contribution”?

The evidence pointed to the second meaning.

The missing optional component should have contributed zero. It should not have nullified the whole total.

Expected:

total = primary_component + 0

The Spark-native rewrite expressed it as:

total = primary_component + optional_component

Under Spark SQL semantics:

primary_component + NULL = NULL

That behavior is correct. Spark was not broken.

The mismatch came from an implicit migration assumption. The legacy path had effectively treated a missing optional contribution as zero. The new Spark expression allowed the missing component to propagate into the derived total.

The fix was to encode the missing-as-zero rule directly:

total = sum(coalesce(primary_component, 0.0) + coalesce(optional_component, 0.0))

The important part was not the coalesce function itself. The important part was identifying what the missing component meant here: no contribution, not unknown.

Validation closed the loop

The decisive comparison was not:

Did the job succeed?

or:

Did the row count match?

It was:

Same group rows, same filters, different derived-total distribution.

Production validation used table history to compare the pre-fix and post-fix states.

Before the fix, roughly 90% of rows had NULL derived totals. After the fix, that count dropped to zero. The filtered output recovered from roughly 10% of the base output to the full slice. The view filters stayed unchanged.

That made the closure empirical, not inferred.

The broader lesson

A migration can preserve schemas, group keys, row counts, and successful job status while still changing the meaning of a derived value.

Shape parity tells you whether the same records exist. Semantic parity tells you whether those records still mean the same thing.

For migrations that rewrite aggregation or arithmetic across execution engines, I would check:

NULL distributions on high-impact derived columns
value distributions before and after the rewrite
filter attribution for dropped rows
pre-fix vs post-fix comparison using stable snapshots or table history

The goal is not to distrust the new engine. It is to find the implicit contracts that the old path carried silently.

In this case, Spark did what Spark should do. The migration became correct only after the missing-as-zero semantic was made explicit.