Business Analyst Technical Assessment

PHDC Episodes & TB Classification

Analysis & Algorithm Design

A comprehensive approach to designing phenotype algorithms for episodes in the Provincial Health Data Centre (PHDC), with diabetes mellitus as a worked example, plus analysis of TB outcomes, treatment statuses, and TB Treatment Action Lists (TTAL).

Luqmaan Mohamed

Scroll down to explore the analysis

Comprehensive analysis covering Question 1 (phenotype algorithms) and Question 2 (TB outcomes & patient classification)

PHDC & Episodes

Conceptual Overview

Phenotype Algorithm

7-Step Design Process

Diabetes Mellitus

Worked Example Algorithm

TB Outcomes & Statuses

Definitions & Value

TB Treatment Action List

Operational Impact

Patient Classification

Cases A–D Analysis

Assessment Coverage

Question 1: Phenotype Algorithms (60%)

→PHDC context & episodes framework
→7-step algorithm design process
→Diabetes mellitus evidence table & scoring
→Validation approach & key insights
→Stakeholder consultation approach
→Limitations & mitigation strategies

Question 2: TB Analysis (40%)

→TB outcome categories (WHO standards)
→TB treatment statuses (real-time)
→Value proposition for WCG DoH&W
→TB Treatment Action List (TTAL) purpose
→Visual flow diagram of TB cascade
→Patient A–D classification with explicit reasoning

PHDC & Episodes – High-Level Overview

The Provincial Health Data Centre (PHDC) consolidates person-level clinical data from multiple systems (clinic, hospital, lab, pharmacy, registers) to infer health conditions as 'episodes' over time. Episodes are then enriched into cascades to support clinical care, surveillance, and analytics.

•Episodes represent health conditions inferred from multiple evidence types.
•Chronic conditions (e.g. HIV, diabetes) usually have a single lifetime episode.
•Acute conditions (e.g. TB, pneumonia) can have multiple episodes with defined start/end.
•Evidence includes: lab tests, drug dispensings, admissions, diagnoses, procedures, and clinical activities.
•Evidence is weighted by confidence to generate an overall score per patient per condition.

Multiple Data Sources

Lab, Pharmacy, Hospital, Clinic

↓

PHDC Consolidation

Harmonise & Link Data

↓

Episode Inference

Multiple Evidence → Confidence Score

↓

Cascades & Outcomes

Enriched Clinical Insights

↓

Clinical & Analytics Outputs

Reports, Alerts, Tools

Designing a Phenotype Algorithm for Episodes

A condition-agnostic 7-step process that can be applied to any health condition: HIV, diabetes, TB, and beyond.

Define the Condition & Episode

→Clarify clinical definition
→Decide if chronic vs acute
→Define episode start/end

Map Available Data Sources

→Identify relevant systems
→Document coverage & frequency
→Assess data quality

Co-Design Evidence with Stakeholders

→Collaborate with clinicians & programme managers
→Internal: data scientists, engineers
→List candidate evidence items

Define Confidence Levels & Scoring

→Categorise evidence by confidence
→Assign numerical scores
→Define high-certainty thresholds

Define Episode Logic

→How to start an episode
→How evidence maintains episode
→When/how episode ends

Prototype & Validate

→Implement in test environment
→Generate line-lists & statistics
→Chart reviews with clinicians

Governance & Iteration

→Document algorithm & assumptions
→Version control changes
→Periodic clinical review

Data Sources Feeding the Episode Algorithm

Clinic & Hospital Systems

Clinicom, PHCIS, PREHMIS

✓Patient registrations & identifiers
✓Encounters (visits, admissions, discharges)
✓Diagnosis & procedure codes (ICD-10)

Laboratory Data

NHLS

✓Diagnostic & monitoring tests
✓Highly structured results
✓Date-stamped & standardised

Pharmacy Data

JAC, CDU

✓Medicines dispensed with ATC codes
✓Refill patterns & dates
✓Indicates ongoing chronic management

Disease Registers

Specialised Systems

✓Electronic TB/HIV registers
✓Chronic disease club lists
✓Programme-specific tracking

Mortality & Outcomes

Vital Registration

✓Dates of death
✓Cause-of-death information
✓Episode closure indicator

Community & Other

CHW Systems, mHealth

✓Community health worker activities
✓mHealth programmes
✓Outreach & engagement data

Evidence Confidence & Scoring

High Confidence

Strongly implies the condition by itself

Examples:

• Repeated diabetes drugs (ATC A10) over time
• 2+ HbA1c results ≥ 48 mmol/mol on different dates

Usage: Can create or maintain episode on its own

Weak–Moderate

Some indication, but may be noisy

Examples:

• Single raised HbA1c above diagnostic threshold
• Single ICD-10 diabetes diagnosis

Usage: Needs combination with other evidence

Supporting

Non-specific but increases confidence

Examples:

• ACE inhibitors + statins with other DM evidence
• Repeated capillary glucose during admission

Usage: Boosts overall score; not used alone

Negating

Suggests prior inference may be incorrect

Examples:

• Explicit "no diabetes" note with normal tests
• Evidence of remission after bariatric surgery

Usage: Subtracts from score; may close episode

Example: Diabetes Mellitus Phenotype Algorithm

Episode Definition

Condition:

Diabetes mellitus (Type 1 & Type 2)

Episode Type:

Chronic (one lifetime episode per person)

Episode Start:

First date when evidence score ≥90 OR first high-confidence evidence

Episode End:

Date of death OR explicit remission evidence

Evidence Items & Scoring

Pharmacy Evidence (JAC, CDU)

Evidence Description	Confidence	Score	Notes
≥2 dispensings of diabetes drug (ATC A10*) on separate dates within 12 months	High	70	Strong indicator of chronic diabetes - implies diagnosed + engaged in care
≥1 insulin dispensing (A10A*) with repeat within 12 months	High	80	Very strong evidence - insulin highly specific to diabetes
Single dispensing of diabetes drug (A10*)	Weak–Moderate	35	Could be trial or misclassified
Concomitant statin + ACEI/ARB with other DM evidence	Supporting	10	Suggests cardiovascular risk management in diabetic patient

Laboratory Evidence (NHLS)

Evidence Description	Confidence	Score	Notes
≥2 HbA1c results ≥ 48 mmol/mol (6.5%) on separate dates within 12 months	High	70	WHO/ADA diagnostic threshold repeated - meets clinical criteria
≥2 fasting plasma glucose ≥ 7.0 mmol/L on different days within 6 months	High	70	WHO diagnostic criterion for diabetes - repeated confirmation
Single HbA1c ≥ 48 mmol/mol or random glucose ≥ 11.1 mmol/L	Weak–Moderate	35	Single elevated result - could be screening, stress, or illness
Borderline HbA1c (42–47 mmol/mol) with strong pharmacy evidence	Supporting	10	Pre-diabetes range or controlled diabetes; supports existing evidence

Encounters & Diagnoses (Clinicom, PHCIS, PREHMIS)

Evidence Description	Confidence	Score	Notes
≥2 encounters with ICD-10 E10–E14 as primary diagnosis	Weak–Moderate	35	Diagnosis coding often incomplete in WC public sector (per PHDC paper)
≥3 clinic visits as "diabetes clinic" or chronic club	Weak–Moderate	35	Indicates clinical engagement with DM care pathway
Single ICD-10 E10–E14 code	Supporting	10	Weak evidence alone due to incomplete coding practices

Negating Evidence

Evidence Description	Confidence	Score	Notes
≥2 normal HbA1c (<42 mmol/mol) after previous DM inference, with no DM drugs for >12 months	Negating	-40	May indicate misclassification or rare remission (e.g. post-bariatric surgery)
Explicit "diabetes excluded" or "no diabetes" note in discharge summary with corroborating normal labs	Negating	-50	Clinical documentation contradicts DM inference

Scoring & Thresholds

→High-certainty episode: Score ≥90 in 12-month window OR single high-confidence evidence
→Possible/low-certainty: Score 35–89 (used for analytics, not clinical tools)
→No episode: Score < 35 OR strong negating evidence

Validation & Iteration Approach

The algorithm must be validated to ensure high-certainty episodes are clinically accurate and usable for both clinical tools and analytics.

Chart Review (Gold Standard Approximation)

Sample 100 patients flagged as "high-certainty DM" and 50 "low-certainty" across 3-5 different facilities (urban/rural mix)

Metric: Positive Predictive Value (PPV)

Target: >95% for high-certainty episodes; 70-85% for low-certainty

Register Comparison

Compare algorithm output to existing chronic disease registers, acknowledging neither is perfect gold standard

Metric: Sensitivity & Coverage

Target: Identify gaps: patients in algorithm but not registers (possible under-registration), and vice versa

Clinical Feedback Loop

Pilot algorithm outputs with clinicians at 3-5 facilities; gather feedback on false positives/negatives and actionability

Metric: Qualitative feedback + face validity

Target: Clinicians confirm >90% of high-certainty cases are true diabetics they recognize

Epidemiological Plausibility

Compare prevalence estimates to national surveys (SADHS), assess age/sex distributions vs expected patterns

Metric: Population-level concordance

Target: Prevalence within 10-15% of survey estimates; age distribution matches known epidemiology

Why Validation Matters

✓Clinical trust: Clinicians must trust algorithm outputs to use them in patient care decisions
✓Iterative improvement: Feedback reveals edge cases and data quality issues for refinement
✓Research validity: Documented PPV/sensitivity enables proper interpretation of epidemiological analyses

Key Insights from the Diabetes Algorithm Exercise

Practical learnings from designing a phenotype algorithm for the PHDC context

Multi-Source Triangulation Compensates for Data Gaps

PHDC's approach of combining pharmacy, lab, and encounter data overcomes known limitations in diagnosis coding quality (documented in the PHDC paper). No single source is perfect, but multiple weak signals converge to high-confidence inferences.

Pharmacy Data = Proxy for Clinical Engagement

Repeated medicine dispensing (especially insulin) scores highly because it reflects not just diagnosis but active patient engagement in care—a key indicator of chronic disease management in the SA public sector context.

Tiered Certainty Enables Dual Use

High-certainty episodes (score ≥90) are safe for clinical decision-support tools (e.g., alerts, patient lists). Low-certainty episodes (35-89) support epidemiological research where sensitivity matters more than specificity, enabling bias analysis.

Validation Builds Trust and Drives Iteration

Chart review and clinical feedback loops aren't just validation—they reveal edge cases (e.g., gestational diabetes misclassified as Type 2, diet-controlled patients with no pharmacy data) that improve the algorithm and earn clinician buy-in.

Context-Specific Weights Reflect Local Reality

Scoring must reflect WC public sector realities: incomplete coding, varying pharmacy digitization across facilities, and diagnostic pathways (e.g., random glucose used more than OGTT in primary care). Algorithm weights are not universal—they're calibrated to the data landscape.

Negating Evidence Prevents "Once Diabetic, Always Diabetic"

Including negating evidence (normal labs after previous inference, explicit clinical exclusion) allows the algorithm to self-correct and close false-positive episodes—critical for maintaining data quality and clinical credibility over time.

Questions for Stakeholders

Phenotype algorithm design is a consultative, collaborative process. Here are key questions I'd ask each stakeholder group to inform evidence selection, scoring, and validation.

Clinicians & Clinical Programme Managers

Q1:

Do you trust random glucose ≥11.1 mmol/L as diagnostic for diabetes, or only fasting glucose?

Why ask: Random glucose is easier to collect in PHC settings (no fasting required), but may have lower specificity. Need to understand local diagnostic pathways.

Q2:

What proportion of your diabetic patients are diet-controlled only, with no medication?

Why ask: These patients will not appear in pharmacy data. Helps quantify sensitivity gap and whether we need alternative evidence sources.

Q3:

How do you currently identify diabetic patients who are lost to follow-up?

Why ask: Understanding existing workflows helps ensure PHDC outputs integrate with (rather than duplicate) current practices.

Q4:

What would make you trust an algorithm-generated patient list enough to use it clinically?

Why ask: Uncovers concerns about false positives/negatives and desired confidence thresholds for actionability.

Data Scientists & Data Engineers

Q1:

How reliable and complete is ATC coding in the pharmacy systems (JAC, CDU)?

Why ask: Determines whether we can trust A10* codes or need manual validation of drug lists.

Q2:

What's the lag between service delivery and data availability in the PHDC?

Why ask: Impacts whether we can use the algorithm for real-time clinical alerts or only retrospective reporting.

Q3:

How should we handle patients with multiple folder numbers (PMI duplicates)?

Why ask: Need technical approach: probabilistic linkage, manual review thresholds, or accept some duplication.

Q4:

What's the compute cost of scoring all 8M patients daily vs weekly batches?

Why ask: Balances timeliness vs infrastructure costs; informs refresh frequency decisions.

Public Health / Epidemiology Teams

Q1:

How does our algorithm-derived diabetes prevalence compare to SADHS survey estimates?

Why ask: Validates population-level plausibility; large discrepancies suggest systematic issues.

Q2:

Should we distinguish Type 1 vs Type 2 diabetes, or is conflation acceptable?

Why ask: Affects algorithm complexity. Type 1 is rare and hard to identify from routine data; may not be worth the effort unless critical.

Q3:

How do we want to handle gestational diabetes—separate phenotype or exclusion criterion?

Why ask: GDM has different clinical significance; need clear decision on whether to flag separately or exclude from general DM algorithm.

Q4:

What's the acceptable positive predictive value (PPV) for research use vs clinical use?

Why ask: Research may tolerate 80% PPV for sensitivity; clinical tools need 95%+. Sets different thresholds for high/low-certainty episodes.

Facilities & Operational Managers

Q1:

Which facilities have the lowest pharmacy digitization coverage?

Why ask: Identifies where algorithm will under-count; can flag these facilities for targeted data quality improvement.

Q2:

How often do diagnosis codes get entered retrospectively vs at point-of-care?

Why ask: Affects whether we can rely on ICD-10 codes; retrospective coding is often less complete.

Q3:

Would your staff use an algorithm-generated "diabetes patient list" for recall campaigns?

Why ask: Tests operational feasibility and user buy-in; reveals workflow integration barriers.

Q4:

What's the current process when a patient transfers between facilities?

Why ask: Understanding transfer documentation helps assess risk of double-counting or loss-to-follow-up misclassification.

The Value of Asking Questions

→Uncovers context: Stakeholders reveal nuances about data collection, clinical workflows, and local practices that desk research misses.

→Builds buy-in: Involving stakeholders from the start creates ownership and increases adoption of algorithm outputs.

→Identifies blind spots: Each group sees different aspects of the data ecosystem; cross-functional dialogue prevents errors.

→Manages expectations: Early conversations about limitations and trade-offs prevent disappointment when algorithm is deployed.

Limitations & Challenges

Designing phenotype algorithms for the PHDC is not without challenges. Recognizing limitations upfront enables proactive mitigation and realistic expectations.

Data Quality & Completeness

Specific Challenges:

⚠Incomplete diagnosis coding: ICD-10 coding is often missing or inaccurate, limiting its utility as primary evidence
⚠Pharmacy data gaps: Not all facilities have digitized dispensing; diet-controlled diabetics have no pharmacy footprint
⚠Laboratory coverage: Some facilities lack consistent lab ordering; rural areas may have lower testing rates

Mitigation Strategy:

Use multi-source triangulation; validate against facility-level data quality metrics; set lower confidence for facilities with known gaps

Patient Master Index (PMI) Linkage

Specific Challenges:

⚠Duplicates: Same patient with multiple folder numbers due to registration errors
⚠Name variations: Nicknames, maiden names, spelling inconsistencies complicate matching
⚠Missing identifiers: Incomplete ID numbers or contact details hinder linkage accuracy

Mitigation Strategy:

Probabilistic linkage with fuzzy matching; manual review of high-volume duplicates; blacklist/whitelist for known errors (per PHDC paper)

Privacy & Governance

Specific Challenges:

⚠Balancing clinical utility vs patient protection: Named data needed for clinical tools but strict consent required for research
⚠Risk of re-identification: Even anonymized data with facility/date granularity can potentially identify individuals
⚠Consent fatigue: Patients may not understand or consent to secondary use of data

Mitigation Strategy:

Privacy-by-design architecture (separate patient/clinical DBs); tiered access controls; explicit patient information campaign with opt-out option

Algorithm Maintenance & Drift

Specific Challenges:

⚠Clinical practice evolution: Guideline changes (e.g., new HbA1c thresholds) require algorithm updates
⚠Source system changes: New systems, retired codes, or data structure changes break pipelines
⚠Performance decay: Algorithm accuracy may drift as population or care patterns change over time

Mitigation Strategy:

Version control for algorithms; automated data quality monitoring; annual validation with clinician review; feedback loops from users

Edge Cases & Misclassification

Specific Challenges:

⚠Gestational diabetes: May be misclassified as Type 2 if pharmacy/lab evidence overlaps with pregnancy window
⚠Steroid-induced hyperglycemia: Transient elevated glucose in hospitalized patients on corticosteroids
⚠Type 1 vs Type 2: Difficult to distinguish in routine data; age is a weak proxy

Mitigation Strategy:

Explicit rules for gestational DM (link to pregnancy episodes); exclude glucose during steroid Rx; accept Type 1/2 conflation unless critical for analysis

Resource & Capacity Constraints

Specific Challenges:

⚠Human capacity: Requires skilled data scientists/analysts—not traditionally in DoH staffing structures
⚠Compute resources: Large-scale daily processing of 8M patients requires significant infrastructure
⚠Stakeholder time: Clinicians and programme managers have limited availability for algorithm co-design

Mitigation Strategy:

Invest in data science hiring/training; leverage cloud/scalable infrastructure; use iterative co-design (short sprints, not long workshops)

Why Acknowledge Limitations?

✓Builds credibility: Transparent about what the algorithm can and cannot do; avoids over-promising to stakeholders.

✓Guides interpretation: Researchers understand potential biases when using PHDC data for epidemiological studies.

✓Prioritizes investment: Knowing where data quality is weakest helps focus improvement efforts (e.g., coding training, pharmacy digitization).

✓Ethical responsibility: Patients and clinicians deserve honesty about data-driven tools' accuracy and limitations.

TB Outcomes & Treatment Statuses

TB Outcome Categories

Retrospective classification of how a TB episode ended (final determination).

Cured

Bacteriologically confirmed TB with negative results in final month + prior occasion

Treatment Completed

Patient completed treatment per protocol but lacks bacteriological cure criteria

Treatment Failed

Regimen terminated due to lack of response, resistance, or side effects

Died

Patient died before or during TB treatment (any cause)

Lost to Follow-up

Did not start treatment OR treatment interrupted for ≥2 consecutive months

Not Evaluated

No treatment outcome assigned (includes transfers with unknown final outcome)

TB Treatment Statuses (Real-time)

Current status showing where each patient is in the treatment cascade today.

Never Started

Evidence of TB but no treatment start recorded

On TB Treatment

Treatment started with recent activity (e.g. visit or medicine pickup within 60 days)

LTFU Before Treatment

TB diagnosed but no treatment within 2 months of diagnosis

LTFU on Treatment

Treatment started but no activity for ≥2 months

Completed Treatment

Completed TB regimen as per guidelines

Died

Deceased after TB diagnosis (with or without treatment)

Transferred Out

Left Western Cape; outcome unknown to provincial service

TB Treatment Cascade: Patient Flow

TB Treatment Cascade Flow Diagram

Patient flow through TB diagnosis, treatment, and outcome stages (as of assessment date: 30 Aug 2025)

TB Evidence Detected

GeneXpert, Culture, CXR, Clinical Dx

↓

Decision: Treatment started?

← No

Treatment Not Started

↓

LTFU Before Treatment

> 60 days since evidence, no Rx start

Outcome: Lost to Follow-up
Status: LTFU Before Rx

Yes →

Treatment Started

TB drugs dispensed, PHC visits logged

↓

Decision: Recent activity?

Activity ≤ 60 days

On Treatment

Active engagement

↓

Status: On TB Rx

Activity > 60 days

LTFU on Treatment

No activity >2 months

↓

Outcome: LTFU
Status: LTFU from Rx

Rx success recorded

Treatment Success

Cured / Completed

↓

Outcome: Cured/Completed
Status: Completed

Other Possible Outcomes (Override Above Logic)

Died

DOD present

Treatment Failed

Flag present

Left Western Cape

Transfer out

Not Found

Support failed

Legend & Key Assumptions

Episode exists: At least one tb_evidence_date present

Treatment started: tb_treatment_date OR phc_treatment_date present

Last activity: Most recent of treatment date, visit date, activity date

LTFU threshold: > 60 days since last activity (before assessment date)

Why TB Outcomes & TTAL Matter for WCG DoH&W

TB Outcome Categories

✓Monitor programme performance (success rate, LTFU, death, failure)
✓Identify weakest points in the TB cascade
✓Support equity analysis by facility, district, age, HIV status
✓Enable research & evaluation of interventions

TB Treatment Action List (TTAL)

✓Daily/weekly updated actionable line list of patients needing follow-up
✓Ensures diagnosed patients are linked to and retained in treatment
✓Reduces LTFU through targeted tracing and outreach
✓Gives facility staff concrete workload: who to call, visit, or book

Patient-Level Classification – Cases A–D

Classification as of 30 August 2025 based on TB evidence, treatment dates, and activity timelines.

Patient A

Timeline

•TB evidence: Multiple dates (May 2023, Jan 2024, Dec 2024, Aug 2025)
• → Likely GeneXpert MTB detected + follow-up cultures
•Treatment start: Multiple TB treatment dates across 2023-2025
• → Standard 6-month regimen with documented dispensing
•Treatment success: 1 Aug 2025 (rx_success_date)
• → Bacteriological cure or completion documented
•Date of death: 26 Aug 2025 (25 days after treatment success)

TB Outcome:

Treatment Success (Cured/Completed)

Current Status:

Died

Evidence Quality:

Strong: Multiple lab results + treatment + documented success

Reasoning: Completed TB treatment successfully with documented cure/completion. Death occurred after TB episode was closed as successful. Per WHO definitions, TB outcome is "cured/completed" even though patient subsequently died (possibly from other causes or advanced HIV/comorbidities).

Patient B

Timeline

•TB evidence: 21 Apr 2021
• → Possible GeneXpert, CXR suggestive, or clinical diagnosis
•No recorded TB treatment start (no tb_treatment_date or phc_treatment_date)
•No subsequent TB-related activity in 4+ years
•No death or transfer out recorded (as of 30 Aug 2025)

TB Outcome:

Lost to Follow-up (Before Treatment)

Current Status:

LTFU Before Treatment

Evidence Quality:

Weak: Single old evidence point, no corroboration

Reasoning: TB evidence recorded >4 years ago, but no treatment initiation in available PHDC data. Well beyond the 60-day window for treatment start. Patient either: (1) never linked to care, (2) sought care outside WC public sector, or (3) data capture error. This is a high-priority case for TTAL-style tracing if patient is still contactable.

Patient C

Timeline

•TB evidence: 23 Sep 2024
• → Lab test or clinical diagnosis documented
•PHC visit: 23 Sep 2024 (same day as evidence)
• → Patient presented to PHC, likely counseled
•No TB treatment start recorded (no tb_treatment_date or phc_treatment_date)
•~11 months elapsed with no treatment initiation (diagnosis to 30 Aug 2025)

TB Outcome:

Lost to Follow-up (Before Treatment)

Current Status:

LTFU Before Treatment

Evidence Quality:

Moderate: Recent evidence + visit, but no follow-through

Reasoning: Patient had initial encounter with TB evidence and same-day PHC visit, suggesting diagnosis/counseling occurred. However, no treatment start documented in the subsequent 11 months. Possible reasons: patient declined treatment, was referred but never attended, or treatment started but not captured in PHDC. This represents a critical gap in the cascade and should trigger active patient tracing.

Patient D

Timeline

•TB evidence: 12 Oct 2024
• → Diagnostic test indicating TB (GeneXpert/culture/CXR)
•PHC treatment start: 23 Jan 2025 (~3 months after evidence)
• → Delay suggests patient initially LTFU, then re-engaged
•Last recorded activity: 23 Mar 2025 (visit or dispensing)
•~160 days (5+ months) with no activity or treatment success (to 30 Aug 2025)
•No rx_success_date, no death, no transfer flags

TB Outcome:

Lost to Follow-up (On Treatment)

Current Status:

LTFU from Treatment

Evidence Quality:

Strong early, weak late: Treatment started but not maintained

Reasoning: Patient started TB treatment but last activity was >160 days ago with no documented treatment success. Standard TB regimen is 6 months; by Aug 2025, patient should have completed treatment (started Jan 2025). Absence of activity and no success flag indicates patient defaulted/interrupted treatment. High priority for contact tracing and re-engagement, especially to assess for drug resistance if treatment was incomplete.

Classification Assumptions

•Episode exists if at least one TB evidence date is present
•Treatment started if either TB treatment date or PHC treatment date is recorded
•Last activity = latest date among treatment start, visit, or activity dates
•LTFU from treatment if no activity for >60 days before assessment date
•LTFU before treatment if evidence date >60 days ago with no treatment start

Sources & References

This assessment draws on clinical guidelines, PHDC documentation, and epidemiological literature to inform algorithm design and TB outcome analysis.

PHDC Architecture & Methods

Boulle A, Heekes A, Tiffin N, et al. Data Centre Profile: The Provincial Health Data Centre of the Western Cape Province, South Africa. International Journal of Population Data Science. 2019;4(2):06.

Relevance: Core reference for PHDC architecture, episode inference methodology, data sources, and privacy-by-design approach.

→ View source

Diabetes Diagnostic Criteria

World Health Organization. Use of Glycated Haemoglobin (HbA1c) in the Diagnosis of Diabetes Mellitus. WHO/NMH/CHP/CPM/11.1. Geneva: WHO; 2011.

Relevance: HbA1c ≥48 mmol/mol (6.5%) diagnostic threshold used in laboratory evidence scoring.

→ View source

American Diabetes Association. Classification and Diagnosis of Diabetes: Standards of Medical Care in Diabetes—2024. Diabetes Care. 2024;47(Supplement 1):S20-S42.

Relevance: Fasting plasma glucose ≥7.0 mmol/L and random glucose ≥11.1 mmol/L diagnostic criteria.

→ View source

SEMDSA Type 2 Diabetes Guidelines Expert Committee. SEMDSA 2017 Guidelines for the Management of Type 2 Diabetes Mellitus. Journal of Endocrinology, Metabolism and Diabetes of South Africa. 2017;22(1):S1-S196.

Relevance: South African clinical context for diabetes management pathways and diagnostic approaches in public sector.

ATC Drug Classification

WHO Collaborating Centre for Drug Statistics Methodology. ATC/DDD Index 2024. Oslo, Norway.

Relevance: ATC code A10 (drugs used in diabetes) classification used for pharmacy evidence; A10A = insulins, A10B = oral hypoglycemics.

→ View source

TB Outcomes & Treatment Standards

World Health Organization. Definitions and Reporting Framework for Tuberculosis – 2013 Revision (Updated December 2014). WHO/HTM/TB/2013.2. Geneva: WHO; 2014.

Relevance: Standard TB outcome categories: cured, treatment completed, treatment failed, died, lost to follow-up, not evaluated.

→ View source

National Department of Health, South Africa. National Tuberculosis Management Guidelines 2014. Pretoria: NDoH; 2014.

Relevance: South African TB treatment protocols and outcome definitions adapted for local context.

Epidemiological Context (South Africa)

Statistics South Africa. South African Demographic and Health Survey 2016. Pretoria: Stats SA; 2017.

Relevance: Population-level diabetes prevalence estimates for validation and comparison (epidemiological plausibility check).

→ View source

Pillay-van Wyk V, Msemburi W, Laubscher R, et al. Mortality trends and differentials in South Africa from 1997 to 2012: second National Burden of Disease Study. Lancet Global Health. 2016;4(9):e642-53.

Relevance: Context for Western Cape quadruple burden of disease and diabetes as part of rising NCD burden.

→ View source

Methodology: Phenotype Algorithms & Inference

von Mering C, Jensen LJ, Snel B, et al. STRING: known and predicted protein-protein associations, integrated and transferred across organisms. Nucleic Acids Research. 2005;33:D433-7.

Relevance: Evidence scoring and roll-up methodology adapted from bioinformatics protein-protein association scoring (cited in PHDC paper).

→ View source

Richesson RL, Hammond WE, Nahm M, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. Journal of the American Medical Informatics Association. 2013;20(e2):e226-31.

Relevance: Conceptual framework for EHR-based phenotyping and validation approaches.

→ View source

Note on Source Tracking

As per the assessment instructions, I've kept track of sources and reasoning throughout this analysis. The diabetes algorithm scoring reflects WHO/ADA diagnostic criteria, PHDC-documented data quality issues (e.g., incomplete coding), and local context (SA public sector pharmacy patterns). TB outcome categories follow WHO standard definitions as applied in South African national guidelines. All assumptions and rationale are explicitly documented in the evidence tables and patient classification logic.

Closing Summary

Structured phenotype algorithms allow PHDC to infer conditions and episodes from imperfect, multi-source data, transforming routine clinical and administrative records into actionable health insights.

Confidence-weighted evidence and clear episode rules make outputs useful for both clinical decision-support tools and population-level epidemiological analytics, enabling different levels of certainty.

Well-defined TB outcomes and TTAL line lists translate data into real-world actions that can materially improve patient care, programme performance, and population health outcomes.

Luqmaan Mohamed

Business Analyst Technical Assessment

PHDC Episodes & TB Classification

Analysis & Algorithm Design

Contents

PHDC & Episodes

Phenotype Algorithm

Diabetes Mellitus

TB Outcomes & Statuses

TB Treatment Action List

Patient Classification

Assessment Coverage

PHDC & Episodes – High-Level Overview

Designing a Phenotype Algorithm for Episodes

Define the Condition & Episode

Map Available Data Sources

Co-Design Evidence with Stakeholders

Define Confidence Levels & Scoring

Define Episode Logic

Prototype & Validate

Governance & Iteration

Data Sources Feeding the Episode Algorithm

Clinic & Hospital Systems

Laboratory Data

Pharmacy Data

Disease Registers

Mortality & Outcomes

Community & Other

Evidence Confidence & Scoring

High Confidence

Weak–Moderate

Supporting

Negating

Example: Diabetes Mellitus Phenotype Algorithm

Episode Definition

Evidence Items & Scoring

Pharmacy Evidence (JAC, CDU)

Laboratory Evidence (NHLS)

Encounters & Diagnoses (Clinicom, PHCIS, PREHMIS)

Negating Evidence

Scoring & Thresholds

Validation & Iteration Approach

Chart Review (Gold Standard Approximation)

Register Comparison

Clinical Feedback Loop

Epidemiological Plausibility

Why Validation Matters

Key Insights from the Diabetes Algorithm Exercise

Multi-Source Triangulation Compensates for Data Gaps

Pharmacy Data = Proxy for Clinical Engagement

Tiered Certainty Enables Dual Use

Validation Builds Trust and Drives Iteration

Context-Specific Weights Reflect Local Reality

Negating Evidence Prevents "Once Diabetic, Always Diabetic"

Questions for Stakeholders

Clinicians & Clinical Programme Managers

Data Scientists & Data Engineers

Public Health / Epidemiology Teams

Facilities & Operational Managers

The Value of Asking Questions

Limitations & Challenges

Data Quality & Completeness

Specific Challenges:

Mitigation Strategy:

Patient Master Index (PMI) Linkage

Specific Challenges:

Mitigation Strategy:

Privacy & Governance

Specific Challenges:

Mitigation Strategy:

Algorithm Maintenance & Drift

Specific Challenges:

Mitigation Strategy:

Edge Cases & Misclassification

Specific Challenges:

Mitigation Strategy:

Resource & Capacity Constraints

Specific Challenges:

Mitigation Strategy:

Why Acknowledge Limitations?

TB Outcomes & Treatment Statuses

TB Outcome Categories