Data quality: why validation takes longer than you think

Data quality issues during validation

You have the data. You have the model. The pipeline runs. Done, right? Not quite. During a recent project for a rehabilitation clinic, we discovered five categories of data errors during the validation phase, each of which would have rendered the model unusable on its own. None of these errors were visible during an initial inspection of the data.

The context

We built a machine learning pipeline that predicts treatment outcomes based on questionnaire data and patient characteristics. The source data came from an electronic health record system. At first glance, everything looked correct: tables populated, fields present, types matched. It was only during systematic validation (row by row, variable by variable) that the problems surfaced.

Five problems you won’t see without validation

1. Ages that looked almost right

220 patients had an incorrectly calculated age. The cause: a subtle difference in the timestamp component of the date of birth field. The difference was small enough to go unnoticed in a sample check, but large enough to shift ages by one year. In a model where age is a predictor, that matters.

2. Missing questionnaires

122 patients turned out to have no completed questionnaire. They were technically present in the dataset (a patient record existed) but the associated questionnaire data was entirely absent. These patients had to be explicitly excluded. Without that step, the model would train on empty or default values as if they were real responses.

3. Wrong score aggregation

For 11 variables, a sum score was calculated where a categorization was actually required. The difference: a score of 3+4+2=9 versus a categorization of “moderate.” The model expects categories, not summations. This is the type of error that only becomes visible when you consult the original scoring manual, not when you just look at the data.

4. Null is not zero

Three conditional variables (the type that is only filled in when a preceding question is answered positively) were treated as 0 when they were null. But null here means “question not applicable,” while 0 means “question answered with no/none.” That distinction is substantively relevant and directly affects model outcomes.

5. Mixed measurement timepoints

The dataset contained measurements from multiple timepoints. Only the first measurement (T0) was relevant for the model. Without explicit filtering on measurement timepoint, values from later measurements were included in calculations, causing scores to be summed that did not belong together.

Why this matters

Each of these five problems was technically subtle. No crashes, no error messages, no missing tables. The data looked fine. And that is precisely the danger: a model that trains on incorrect but plausible data produces results that also look plausible. You only notice when you review the outcomes against domain knowledge, or when a domain expert says: “this doesn’t add up.”

In any domain where decisions are based on data, that is an unacceptable risk.

The lesson: build a validation protocol

After this project, we apply a fixed validation protocol to every data pipeline:

Field validation: compare every field against the original definition and scoring manual. Do not assume the database column name tells the whole story.
Handle edge cases explicitly: null values, missing records, and conditional logic must have documented rules.
Aggregation checks: verify every calculated variable against the original methodology. Summation versus categorization is a common mistake.
Filter validation: explicitly verify measurement timepoints, time periods, and inclusion criteria. Do not assume the data is pre-filtered.
Sample review with a domain expert: walk through a subset of results with someone who understands the domain context.

Conclusion

Data quality is not a checkbox. It is a discipline. The validation phase of this project took more time than the initial pipeline build, and that is exactly how it should be. Because a model running on bad data is worse than no model at all.

Invest in validation before you go to production. Not afterwards. Not as an afterthought. As a foundation.