Drug Exposure Diagnostics¶
The omopy.drug_diagnostics module provides comprehensive diagnostic
checks on drug exposure records in an OMOP CDM database. It is the
Python equivalent of the R DrugExposureDiagnostics package.
Overview¶
The module has four layers:
- Execute — run configurable diagnostic checks on drug_exposure records
- Summarise — convert check results into the standard SummarisedResult format
- Table — format results as publication-ready tables (via
omopy.vis) - Plot — visualise checks as bar charts and box plots
Available Checks¶
The execute_checks() function supports 12 configurable checks:
| Check | Description |
|---|---|
"missing" |
Missing value counts for 15 drug_exposure columns |
"exposure_duration" |
Quantile distribution of exposure duration (end - start + 1 days) |
"type" |
Frequency of drug_type_concept_id values |
"route" |
Frequency of route_concept_id values |
"source_concept" |
Source concept mapping analysis |
"days_supply" |
Quantile distribution + comparison with date diff |
"verbatim_end_date" |
Comparison of verbatim_end_date vs drug_exposure_end_date |
"dose" |
Daily dose coverage (requires drug_strength data) |
"sig" |
Frequency of sig (verbatim instruction) values |
"quantity" |
Quantile distribution of quantity field |
"days_between" |
Time between consecutive records per patient |
"diagnostics_summary" |
Aggregated summary across all other checks |
Step 1: Connect to CDM¶
import ibis
from omopy.connector import cdm_from_con
con = ibis.duckdb.connect("my_database.duckdb", read_only=True)
cdm = cdm_from_con(con, cdm_schema="cdm")
Step 2: Run Diagnostics¶
Specify one or more ingredient concept IDs and which checks to run:
from omopy.drug_diagnostics import execute_checks
# Run all checks for two ingredients
result = execute_checks(
cdm,
ingredient_concept_ids=[1125315, 1503297],
sample_size=10_000, # Sample per ingredient (None = all records)
min_cell_count=5, # Privacy protection threshold
)
# Run specific checks only
result = execute_checks(
cdm,
ingredient_concept_ids=[1125315],
checks=["missing", "exposure_duration", "type"],
)
Step 3: Explore Results¶
The result is a DiagnosticsResult object containing a dict of Polars
DataFrames — one per check:
# Dict-like access
result["missing"] # -> pl.DataFrame
result["exposure_duration"] # -> pl.DataFrame
result["type"] # -> pl.DataFrame
# Metadata
result.checks_performed # -> ('missing', 'exposure_duration', ...)
result.ingredient_concepts # -> {1125315: 'Acetaminophen', ...}
result.execution_time_seconds # -> 2.345
# Iterate
for check_name, df in result.items():
print(f"{check_name}: {df.height} rows")
Understanding Missing Values¶
missing = result["missing"]
# Columns: ingredient_concept_id, ingredient, variable,
# n_records, n_sample, n_missing, n_not_missing,
# proportion_missing
print(missing.filter(pl.col("proportion_missing") > 0.5))
Understanding Duration Distribution¶
duration = result["exposure_duration"]
# Columns include: duration_q05 through duration_q95,
# duration_mean, duration_sd, duration_min, duration_max,
# n_negative_duration, proportion_negative_duration
Step 4: Summarise to SummarisedResult¶
Convert to the standard 13-column format for interop with table/plot functions and other OMOPy modules:
from omopy.drug_diagnostics import summarise_drug_diagnostics
summary = summarise_drug_diagnostics(result)
# -> SummarisedResult with result_type per check
Step 5: Visualise¶
Tables¶
from omopy.drug_diagnostics import table_drug_diagnostics
# All checks as one table
table = table_drug_diagnostics(summary, type="polars")
# Single check
table = table_drug_diagnostics(summary, check="missing", type="gt")
Plots¶
from omopy.drug_diagnostics import plot_drug_diagnostics
# Missing values bar chart
fig = plot_drug_diagnostics(summary, check="missing")
fig.show()
# Exposure duration box plot
fig = plot_drug_diagnostics(summary, check="exposure_duration")
# Drug type frequencies
fig = plot_drug_diagnostics(summary, check="type")
# Custom title
fig = plot_drug_diagnostics(
summary,
check="route",
title="Route Frequency for Acetaminophen",
)
Privacy Protection¶
The min_cell_count parameter replaces counts below the threshold with
None and adds a result_obscured column. Set to 0 to disable:
# Strict privacy
result = execute_checks(cdm, [1125315], min_cell_count=10)
# No suppression (for internal analysis)
result = execute_checks(cdm, [1125315], min_cell_count=0)
Sampling¶
For large datasets, sample_size limits the number of records analysed
per ingredient (default: 10,000). Set to None for all records:
# Quick analysis
result = execute_checks(cdm, [1125315], sample_size=1000)
# Full analysis
result = execute_checks(cdm, [1125315], sample_size=None)
Testing with Mock Data¶
from omopy.drug_diagnostics import mock_drug_exposure
# Generate mock DiagnosticsResult for testing
mock_result = mock_drug_exposure(
n_ingredients=3,
n_records_per_ingredient=200,
seed=42,
)
mock_result["missing"] # Synthetic missing data
Benchmarking¶
from omopy.drug_diagnostics import benchmark_drug_diagnostics
bench = benchmark_drug_diagnostics(
cdm,
ingredient_concept_ids=[1125315, 1503297],
n_runs=3,
)
print(bench)
# -> DataFrame with run, ingredient, execution_time_seconds
Comparison with R¶
| R (DrugExposureDiagnostics) | Python (omopy.drug_diagnostics) |
|---|---|
executeChecks() |
execute_checks() |
mockDrugExposure() |
mock_drug_exposure() |
writeResultToDisk() |
Use df.write_csv() / df.write_parquet() |
viewResults() (Shiny) |
Not ported — use plotly interactivity |
| Named list of tibbles | DiagnosticsResult (Pydantic model, dict of Polars DataFrames) |
minCellCount |
min_cell_count |
sampleSize |
sample_size |