Test Data Generation¶
The omopy.testing module provides utilities for creating small,
hand-crafted test patient populations for OMOP CDM studies. It is the
Python equivalent of the R TestGenerator package.
Overview¶
The module supports two complementary workflows:
- File-based — Define patients in Excel/CSV, validate, export to JSON, then load into a CdmReference
- Programmatic — Generate synthetic mock CDMs directly in code for quick unit tests
Both workflows produce a standard CdmReference that can be used with
all other OMOPy modules.
Workflow 1: File-Based Test Patients¶
Step 1: Generate a Template¶
Create blank Excel templates with the correct CDM column headers:
from omopy.testing import generate_test_tables
# Generate template for specific tables
path = generate_test_tables(
["person", "observation_period", "condition_occurrence", "drug_exposure"],
cdm_version="5.4",
output_path="tests/fixtures/",
filename="my_study_patients.xlsx",
)
print(f"Template created at: {path}")
The template contains one sheet per requested table, with column headers matching the CDM specification.
Step 2: Fill In Patient Data¶
Open the generated Excel file and enter your test patient data. Each row
represents one record. For example, the person sheet might contain:
| person_id | gender_concept_id | year_of_birth | race_concept_id |
|---|---|---|---|
| 1 | 8507 | 1980 | 8527 |
| 2 | 8532 | 1975 | 8527 |
| 3 | 8507 | 1990 | 8516 |
Step 3: Read and Validate¶
from omopy.testing import read_patients
# Read Excel, validate, and optionally export to JSON
data = read_patients(
"tests/fixtures/my_study_patients.xlsx",
cdm_version="5.4",
test_name="my_study",
output_path="tests/fixtures/patients.json", # Optional JSON export
)
# data is a dict[str, pl.DataFrame] — one DataFrame per CDM table
Step 4: Load into CdmReference¶
from omopy.testing import patients_cdm
cdm = patients_cdm(
"tests/fixtures/patients.json",
cdm_version="5.4",
cdm_name="my_test_cdm",
)
# Use with any OMOPy module
from omopy.profiles import add_demographics
result = add_demographics(cdm["person"], cdm)
Workflow 2: Programmatic Mock CDM¶
For quick unit tests that don't need hand-crafted data:
from omopy.testing import mock_test_cdm
cdm = mock_test_cdm(
seed=42,
n_persons=20,
cdm_version="5.4",
include_conditions=True,
include_drugs=True,
include_measurements=True,
)
# Inspect generated data
print(cdm["person"].collect())
print(cdm["observation_period"].collect())
print(cdm["condition_occurrence"].collect())
Mock CDM Parameters¶
| Parameter | Default | Description |
|---|---|---|
seed |
42 |
Random seed for reproducibility |
n_persons |
5 |
Number of persons to generate |
cdm_version |
"5.4" |
CDM version ("5.3" or "5.4") |
include_conditions |
True |
Generate condition_occurrence records |
include_drugs |
True |
Generate drug_exposure records |
include_measurements |
False |
Generate measurement records |
Data Validation¶
Validate a dict of DataFrames against the CDM specification without reading from files:
from omopy.testing import validate_patient_data
import polars as pl
data = {
"person": pl.DataFrame({
"person_id": [1, 2],
"gender_concept_id": [8507, 8532],
"year_of_birth": [1980, 1975],
"race_concept_id": [8527, 8527],
"ethnicity_concept_id": [0, 0],
}),
"observation_period": pl.DataFrame({
"observation_period_id": [1, 2],
"person_id": [1, 2],
"observation_period_start_date": ["2020-01-01", "2020-01-01"],
"observation_period_end_date": ["2023-12-31", "2023-12-31"],
"period_type_concept_id": [44814724, 44814724],
}),
}
issues = validate_patient_data(data, cdm_version="5.4")
# Returns validation issues (empty if all valid)
Cohort Timeline Visualization¶
Visualize cohort membership timelines for individual patients:
from omopy.testing import graph_cohort
import polars as pl
# Define cohort data
target_cohort = pl.DataFrame({
"cohort_definition_id": [1, 1],
"subject_id": [1, 2],
"cohort_start_date": ["2020-01-01", "2020-06-01"],
"cohort_end_date": ["2020-12-31", "2021-05-31"],
})
outcome_cohort = pl.DataFrame({
"cohort_definition_id": [2, 2],
"subject_id": [1, 2],
"cohort_start_date": ["2020-03-15", "2020-09-10"],
"cohort_end_date": ["2020-03-15", "2020-09-10"],
})
# Plot timeline for subject 1
fig = graph_cohort(
subject_id=1,
cohorts={
"Target": target_cohort,
"Outcome": outcome_cohort,
},
)
fig.show()
The timeline plot shows horizontal bars for each cohort the patient belongs to, with start and end dates on the x-axis.
Integration with Other Modules¶
The CdmReference objects produced by patients_cdm() and mock_test_cdm()
are Polars-backed (no database). They work with any OMOPy module that
accepts a CdmReference:
from omopy.testing import mock_test_cdm
from omopy.characteristics import summarise_characteristics
cdm = mock_test_cdm(seed=42, n_persons=50)
# Create a cohort from the mock data
# (the mock CDM includes observation_period records that can serve as
# a simple cohort definition)
from omopy.generics import CohortTable
import polars as pl
obs = cdm["observation_period"].collect()
cohort_df = obs.select(
pl.lit(1).cast(pl.Int64).alias("cohort_definition_id"),
pl.col("person_id").alias("subject_id"),
pl.col("observation_period_start_date").alias("cohort_start_date"),
pl.col("observation_period_end_date").alias("cohort_end_date"),
)
settings = pl.DataFrame({
"cohort_definition_id": [1],
"cohort_name": ["all_patients"],
})
cohort = CohortTable(cohort_df, settings=settings)
cohort.cdm = cdm
Comparison with R¶
| R (TestGenerator) | Python (omopy.testing) |
|---|---|
readPatients() |
read_patients() |
validatePatientData() |
validate_patient_data() |
patientsCDM() |
patients_cdm() |
mockTestCDM() |
mock_test_cdm() |
generateTestTables() |
generate_test_tables() |
graphCohort() |
graph_cohort() |
| tibbles + R list | Polars DataFrames + CdmReference |
| readxl / writexl | openpyxl |