Using CatBoost.ai in Healthcare

Because SNOMED and ICD Codes must be treated as categories for gradient boosting

The 8 main blood types (A+, A-, B- rare, B+, O+, O- universal, AB+, AB-) are categories. If I used label encoding then each category becomes integer (e.g A+ =0, B-=1, AB+=2). This is compact but introduces a false ordering and the the model might interpret AB+ as “greater than” A+ which would be meaningless especially if I am trying to predict Accident & Emergency bloods demand. This is where CatBoost is extremely useful.

A&E demand forecasting is about predicting how many patients will walk through the door, what they’ll present with, and what resources they’ll need, hours, days or weeks ahead. The operational value is significant because staffing, bed management and ambulance allocation all depend on anticipated demand, and most trusts still rely on historical averages and clinical intuition.

The input features are overwhelmingly categorical and temporal. Day of week is the single strongest predictor, Mondays are consistently the busiest day in most emergency departments (reason at bottom of this article). Month matters because respiratory presentations surge in winter and minor injuries spike in summer. Bank holidays create distinctive patterns, the day after a bank holiday is often busier than a normal Monday because people defer attendance. You layer on trust-specific features (urban teaching hospital versus rural behaves very differently), local events (football matches, festivals, school holidays), weather category (ice increases falls, heatwaves increase cardiac and respiratory), and flu & respiratory data.

The target variable can be modelled at different granularities depending on what operational decisions you’re trying to support. Total daily attendance is the simplest. But breaking it down by acuity, the severity of a patient’s illness or condition and the corresponding intensity of care, nursing, or resources required, is more useful. Predicting majors, minors and resus separately lets you plan staffing mix rather than just headcount. Predicting by four-hour time blocks is even more valuable because it catches the afternoon surge pattern that most departments experience between 2pm and 6pm, which is when the GP-referred patients and the “waited all day to see if it got better” walk-ins arrive together.

The reason CatBoost works well here is that the relationships are interactive and nonlinear. A wet Tuesday in February at a city centre teaching hospital has a very different profile from a wet Tuesday in February at a coastal DGH. The model needs to learn that “icy conditions + over-65 population + weekend” means a spike in hip fractures that will consume orthopaedic capacity, while “heatwave + urban + weekday” means dehydration and cardiac presentations. Manually engineering those cross-features is laborious and brittle. Gradient boosting discovers them automatically.

What makes this practically deployable in the NHS is that the data already exists. Every A&E attendance is recorded in the Emergency Care Data Set (ECDS) with timestamps, acuity, diagnosis, disposal and wait times. Most trusts have this flowing into their data warehouse daily. You don’t need to build a new data pipeline as you’re scoring against data that’s already being collected for mandatory reporting.

A typical architecture would train on two to three years of historical ECDS data, retrain weekly or monthly to capture seasonal drift, and output a rolling seven-day forecast at trust and acuity level. It can be presented as a dashboard showing predicted versus actual attendance with confidence intervals, overlaid with staffing rotas so operational managers can see where gaps are emerging. The model’s prediction for “next Tuesday’s majors” has shown 145 patients with a 90% confidence interval of 130–160. From this process can generate the relevant escalation protocols.

Pull Data from DuckDB

DuckDB acts as the analytical query engine sitting on top of your ECDS extract. The reason it fits this use case well is that A&E forecasting doesn’t need a running database server — you’re working with a few million rows of historical attendances that land as Parquet or CSV files from your trust’s data warehouse, and DuckDB reads those directly without any ingestion step. You point it at the file, query it with standard SQL, and get a pandas DataFrame back in seconds.

con = duckdb.connect("ecds.duckdb")
df = con.sql("""
SELECT
attendance_date,
trust_code,
acuity_category,
COUNT() AS attendance_count, DAYOFWEEK(attendance_date) AS day_of_week, MONTH(attendance_date) AS month, attendance_date IN (SELECT date FROM bank_holidays) AS is_bank_holiday, attendance_date - INTERVAL 1 DAY IN (SELECT date FROM bank_holidays) AS is_day_after_bank_holiday, w.weather_category, f.flu_rate, AVG(COUNT()) OVER (
PARTITION BY trust_code, acuity_category
ORDER BY attendance_date
ROWS BETWEEN 7 PRECEDING AND 1 PRECEDING
) AS rolling_7d_avg,
AVG(COUNT(*)) OVER (
PARTITION BY trust_code, acuity_category
ORDER BY attendance_date
ROWS BETWEEN 28 PRECEDING AND 1 PRECEDING
) AS rolling_28d_avg
FROM ecds_attendances a
LEFT JOIN weather w ON a.attendance_date = w.date AND a.trust_code = w.trust_code
LEFT JOIN flu_surveillance f ON a.attendance_date = f.week_start
GROUP BY attendance_date, trust_code, acuity_category,
w.weather_category, f.flu_rate
ORDER BY attendance_date
""").df()

CatBoost for model

CatBoost is a Python library that takes a table of data — rows of examples, columns of features, and a target you want to predict — and builds a gradient boosted decision tree model. At its core, it reads your data, builds hundreds of small decision trees sequentially where each tree corrects the mistakes of the ones before it, and gives you back a trained model you can use to make predictions on new data.

The evaluation framework matters. You’d measure on mean absolute error and mean absolute percentage error at daily and four-hourly granularity, but you’d also track directional accuracy — did the model correctly predict whether tomorrow would be busier or quieter than average? Getting the direction right is often more operationally useful than precise counts, because it triggers different management responses.

Telling it which features are categorical is the crucial step that differentiates CatBoost from other libraries. When you pass cat_features, CatBoost doesn’t one-hot encode them or treat them as numbers. Instead it uses ordered target statistics — for each categorical value, it calculates a running average of the target variable using only the rows that came before the current one in a random permutation. This avoids target leakage (where the model peeks at the answer through the encoding) and handles high-cardinality categories like ICD-10 codes gracefully without exploding into thousands of columns.

If you don’t specify cat_features, CatBoost treats everything as numerical. Your trust codes and diagnosis codes would be interpreted as numbers where “E11.9” > “E10.1” in some meaningless arithmetic sense. So this step is non-negotiable for healthcare data.

from catboost import CatBoostRegressor, Pool
# Define which columns are categorical — this is the critical step
cat_features = ["trust_code", "acuity_category", "day_of_week",
"month", "weather_category", "is_bank_holiday",
"is_day_after_bank_holiday"]
# Time-based split: train on everything before the cutoff, test on everything after
cutoff_date = "2025-01-01"
train = df[df["attendance_date"] < cutoff_date]
test = df[df["attendance_date"] >= cutoff_date]
X_train = train.drop(columns=["attendance_date", "attendance_count"])
y_train = train["attendance_count"]
X_test = test.drop(columns=["attendance_date", "attendance_count"])
y_test = test["attendance_count"]
# CatBoost Pool wraps data + categorical feature indices together
train_pool = Pool(X_train, y_train, cat_features=cat_features)
test_pool = Pool(X_test, y_test, cat_features=cat_features)
model = CatBoostRegressor(
iterations=1000,
learning_rate=0.05,
depth=6,
eval_metric="MAE",
verbose=100
)
model.fit(train_pool, eval_set=test_pool)

MLFlow logs the data and the model

MLFlow sits between training and deployment and solves the problem that every NHS data science team eventually hits: which version of this model is actually running, what was it trained on, and why did we change it? Every time CatBoost finishes a training run, MLFlow captures the full context — the hyperparameters, the training and test metrics, the cutoff date, the row counts, and the serialised model itself as an artifact. That means when a retrain produces worse MAE than last month’s model, you can see exactly what changed and roll back in one line.

import mlflow
import mlflow.catboost
mlflow.set_experiment("ae_demand_forecast")
with mlflow.start_run():
mlflow.log_params({
"iterations": 1000,
"learning_rate": 0.05,
"depth": 6,
"cutoff_date": cutoff_date,
"train_rows": len(X_train),
"test_rows": len(X_test)
})
mlflow.log_metrics({
"mae": mae,
"mape": mape,
"directional_accuracy": direction_correct
})
mlflow.catboost.log_model(model, "ae_forecast_model")

Evidently AI monitoring

Evidently AI allows monitoring for distribution drift in the input features, particularly around coding practice changes, like when a trust adopts a new triage system or changes its streaming model, the historical patterns break and the model needs retraining.

The part that gets clinically interesting is when you move from volume forecasting to acuity prediction. If you can predict not just that 150 patients will arrive but that 12 of them will likely need resus-level care based on the prevailing conditions, you can pre-position senior clinicians and ensure critical care beds are available. That’s where the model starts influencing patient outcomes rather than just operational efficiency, and where the clinical governance and safety guardrails around model deployment become essential.

Bringing it all together

The pipeline from end to end is: ECDS data lands in your warehouse, DuckDB queries it into a training-ready feature table, CatBoost trains a model that natively handles all the categorical healthcare codes without manual encoding, MLFlow versions and tracks every training run so you can roll back if a retrain degrades performance, and Evidently monitors the live prediction stream for drift that signals when the model’s assumptions have gone stale. Each component does one job and hands off cleanly to the next.

What makes this architecture practical rather than academic is that it respects how NHS trusts actually operate. The data already exists in ECDS. The compute requirements are modest, CatBoost trains on three years of daily data in minutes on a standard laptop, so you don’t need a GPU cluster or a cloud ML platform. The retraining cadence is weekly or monthly, not real-time, which means an analyst can trigger it manually or schedule it as a cron job without building an elaborate orchestration layer. The output is a seven-day rolling forecast at trust and acuity level, which maps directly onto the operational planning horizon that bed managers and clinical leads already work with.

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
# --- Full pipeline: query → train → log → monitor ---
# 1. Pull features from DuckDB
df = con.sql("SELECT * FROM ae_features").df()
# 2. Train CatBoost
train_pool = Pool(X_train, y_train, cat_features=cat_features)
model = CatBoostRegressor(iterations=1000, learning_rate=0.05, depth=6)
model.fit(train_pool)
# 3. Log to MLFlow
with mlflow.start_run():
mlflow.catboost.log_model(model, "ae_forecast_model")
mlflow.log_metrics({"mae": mae, "mape": mape})
# 4. Score next 7 days
next_week = con.sql("SELECT * FROM ae_features_next_7_days").df()
next_week["predicted_attendance"] = model.predict(
Pool(next_week, cat_features=cat_features)
)
# 5. Monitor for drift
drift_report = Report(metrics=[DataDriftPreset()])
drift_report.run(reference_data=X_train, current_data=next_week[X_train.columns])
drift_report.save_html("drift_report.html")

AI Governance

The harder part isn’t the technology, it’s the governance. Before a model like this influences staffing decisions, it needs clinical sign-off, a clear escalation policy for when predictions diverge significantly from actuals, and a fallback protocol for when the model is unavailable or known to be degraded. You’d want a model card documenting what it was trained on, what it can and can’t predict, its known failure modes (it will underperform during unprecedented events like pandemic surges or industrial action), and who is accountable for its outputs. The Evidently drift monitoring feeds directly into this governance framework — when drift alerts fire, the model’s confidence should be downweighted and operational managers should be told to rely more heavily on clinical judgement until retraining restores accuracy.

The trajectory from here is towards richer prediction targets. Volume and acuity are the starting point, but the same architecture can be extended to predict average length of stay in the department, four-hour breach probability, ambulance handover delays, and even which specialties will face the highest referral load from A&E on a given day. Each extension adds operational value and each one is built on the same categorical-feature-heavy data that CatBoost handles natively. The foundation you build for demand forecasting becomes the platform for a broader suite of operational intelligence tools that help emergency departments move from reactive firefighting to proactive planning.

Why Monday is the peak day in most emergency departments but Acuity matters more than total numbers

Several factors converge to make Monday the peak day in most emergency departments. The biggest driver is deferred demand from the weekend. GP surgeries are closed Saturday and Sunday, so patients who develop symptoms over the weekend have limited options — they either attend A&E at the weekend (some do, which is why Saturday is often the second busiest day) or they wait to see if things improve and then present on Monday when they haven’t. But crucially, many patients who would normally have gone to their GP on a Friday afternoon or over the weekend now have no primary care route until Monday morning. GP practices are then fully booked on Monday, can’t absorb the backlog, and redirect patients to A&E — either explicitly through triage or implicitly because patients can’t get an appointment and default to the emergency department.

The 111 service amplifies this. NHS 111 call volumes spike over the weekend, and a significant proportion of 111 dispositions result in “attend A&E” or “see a GP urgently within 6 hours.” The urgent GP referrals generated on Sunday evening land on Monday morning when practices are already overloaded, and many convert to A&E attendances.

Care homes contribute meaningfully. Weekend staffing in residential and nursing homes is thinner, with fewer senior carers and less access to GP support. Residents who deteriorate over the weekend are often managed conservatively until Monday when the regular staff return, notice the decline, and call an ambulance. Falls that happened on Saturday might not get assessed until Monday when the care home’s visiting GP does their rounds.

Mental health presentations follow a similar pattern. Community mental health teams operate reduced weekend services, crisis lines are stretched, and patients who’ve been struggling through the weekend present on Monday — either self-referred to A&E or brought in by concerned family who’ve been with them over the weekend and have seen the severity.

There’s a social and behavioural layer too. Alcohol-related presentations from Friday and Saturday nights generate some direct weekend attendances, but the secondary effects — injuries that stiffen up overnight, withdrawal symptoms in dependent drinkers, and domestic incidents that escalate over a weekend spent together — manifest on Monday. Workplace injuries from Monday being the first day back also contribute, though this is a smaller factor.

The pattern isn’t uniform across acuity. Minors (cuts, sprains, minor illness) show the strongest Monday spike because these are the cases most amenable to GP management that got deferred. Majors show a more even distribution across the week because genuinely sick patients tend to come in regardless of the day. Resus is relatively day-independent because cardiac arrests and major trauma don’t wait for Monday.

This is precisely why the forecasting model needs to predict by acuity band rather than just total volume. A Monday surge of 30 extra minors patients needs a different response (more nurse practitioners, a rapid assessment stream) than a surge of 30 extra majors (more senior doctors, more bed capacity). Getting the composition right is more valuable than getting the total right.

Leave a comment