Healthcare analytics and ML platform on Google Cloud

A rehabilitation clinic specializing in chronic pain treatment had a scientifically validated ML model that could predict treatment outcomes. The model existed as a Python script on a laptop, while patient data was locked inside an EHR system with no export capabilities. This is the story of how that became a production platform.

Situation

The clinic treats patients with chronic pain, a complex condition where treatment outcomes vary widely per patient. A domain expert had developed a machine learning model based on scientific research. The model used 51 patient parameters, such as demographics, questionnaire scores, and diagnosis characteristics, to generate a prognosis for the expected treatment outcome.

The model worked on a laptop with manually copied data as a one-off exercise. However, it was not scalable, reproducible, or usable in daily clinical practice.

At the same time, all patient data was locked inside the clinic’s EHR system. With no API or export function, there was no way to get data out in a structured format, and the EHR vendor had no plans to change that.

The clinic came to us with a concrete request): help us unlock that data. What followed was a collaboration that grew from a single connector to a complete data and ML platform.

Challenge

The technical challenges existed on three levels:

Unlocking the data. The EHR system was a black box. With no documented database structure or official export capability, the only way to reach the data was through the clinic’s own network.

Transforming the data. The ML model required 51 specific parameters. These did not exist as ready-made fields in the EHR; they had to be calculated from raw data. Questionnaire scores needed aggregating, ages needed calculating, and conditional variables needed deriving. Every calculation had to match the scientific definition exactly.

Bringing the model to production. A scikit-learn model on a laptop is fundamentally different from a model that runs daily, automatically processes new data, and makes results available to researchers. That requires infrastructure, orchestration, monitoring, and a reporting layer.

Approach

We built the platform in phases, with each phase delivering immediate value.

Phase 1: EHR Connector

The first step was unlocking the patient data. We built an EHR Connector: a secure site-to-site VPN tunnel between the clinic’s network and Google Cloud, with a Cloud Run service that extracts the relevant tables daily and writes them to BigQuery.

The connector runs early every morning when the load on the source system is minimal. Data is available before 7 AM, when the first clinicians start their day.

This was the turning point. For the first time, the clinic had its own copy of patient data in an open, queryable format, independent of the EHR vendor.

Phase 2: Data platform on BigQuery

With the EHR data in BigQuery, we expanded the platform to include multiple sources. HR data was connected for workforce analytics, and questionnaire data was integrated for treatment evaluations. BigQuery became the central data store, where all sources converge.

On top of the raw data, we built a dbt transformation layer. This is where the 51 patient parameters are calculated from source data. Every transformation is documented, tested, and version-controlled. When a domain expert modifies a parameter definition, the change is traceable and reviewable.

Phase 3: Data quality validation

This was the moment the project shifted from technical to scientific. When we calculated the first parameter set and compared the results against the output from the original script, they did not match.

The analysis that followed was sobering:

220 patients had an incorrectly calculated age. The original script used a different rounding method than the scientific definition prescribed.
11 variables had incorrect score aggregation. Subtotals were calculated differently than the questionnaire manual specified.
Multiple conditional variables did not distinguish between null and 0. A patient who had not answered a question (null) was treated the same as a patient who scored “0,” even though those are fundamentally different situations.

These were not bugs in our platform; they were errors that had existed in the original script all along, invisible as long as nobody systematically validated the results.

We developed a formal validation protocol. Each parameter was individually validated against the scientific source definition. Only after approval by the domain expert did a parameter go to production.

The lesson: an ML model is only as good as the data that goes into it. Without rigorous validation of input parameters, every prediction is unreliable.

Phase 4: ML platform

With validated parameters, we built the ML platform using the following architecture:

scikit-learn models packaged as Cloud Run Jobs: serverless, scalable, and without permanent infrastructure costs.
GCP Workflows as an orchestration layer to coordinate the daily pipeline from data extraction to model prediction.
Results to BigQuery: prognoses are written back as tables, immediately available for analysis.
Power BI dashboards: researchers access results via Power BI, reading directly from BigQuery.

The entire flow from EHR extraction to available prognosis runs daily and is fully automated, with monitoring and alerting at every step.

Phase 5: Monitoring and operations

A production platform in healthcare must be reliable. We implemented monitoring at multiple levels:

Pipeline monitoring: verifying if the extraction is running, data is arriving, and volumes make sense.
Data quality checks: identifying values outside expected ranges or unexpected nulls.
Model monitoring: ensuring models produce output and that results do not deviate significantly from historical patterns.

When anomalies occur, an alert fires immediately.

Result

Patient data unlocked. An EHR system that had been a black box for years now delivers structured data daily to a central data platform. The clinic is no longer dependent on the EHR vendor for insight into its own data.

ML model in production. A scientific model that lived on a laptop now runs as an automated daily pipeline. Researchers see up-to-date treatment prognoses every morning.

51 parameters validated. The data quality validation revealed structural errors that had existed in the original script for years. Every parameter is now individually tested against the scientific definition.

Data-driven treatment decisions. Researchers use the platform to support treatment decisions. It serves not as a replacement for clinical expertise, but as a supplement: an additional data point in a complex decision process.

Reference case for rollout. The platform serves as a blueprint for similar rehabilitation clinics. The architecture is generic enough to work with other EHR systems and other ML models.

From connector to strategic partnership. What started as a technical request grew into a broad partnership. We manage the platform, advise on data strategy, and collaborate on new applications. The clinic considers us their data department.

What this demonstrates

Healthcare data is difficult. EHR systems are closed, data is sensitive, and the distance between a scientific model and a production platform is larger than expected.

The biggest pitfall is not in the technology; it is in the assumption that the data is correct. Without formal validation of every parameter against the source, verified by a domain expert, you are building a platform on quicksand.

The technology is secondary. BigQuery, dbt, Cloud Run, and scikit-learn are just tools. The difference lies in the discipline: validating data before running models and building a platform that is both functional and trustworthy.