Test Data Generation¶

The omopy.testing module provides utilities for creating small, hand-crafted test patient populations for OMOP CDM studies. It is the Python equivalent of the R TestGenerator package.

Overview¶

The module supports two complementary workflows:

File-based — Define patients in Excel/CSV, validate, export to JSON, then load into a CdmReference
Programmatic — Generate synthetic mock CDMs directly in code for quick unit tests

Both workflows produce a standard CdmReference that can be used with all other OMOPy modules.

Workflow 1: File-Based Test Patients¶

Step 1: Generate a Template¶

Create blank Excel templates with the correct CDM column headers:

from omopy.testing import generate_test_tables

# Generate template for specific tables
path = generate_test_tables(
    ["person", "observation_period", "condition_occurrence", "drug_exposure"],
    cdm_version="5.4",
    output_path="tests/fixtures/",
    filename="my_study_patients.xlsx",
)
print(f"Template created at: {path}")

The template contains one sheet per requested table, with column headers matching the CDM specification.

Step 2: Fill In Patient Data¶

Open the generated Excel file and enter your test patient data. Each row represents one record. For example, the person sheet might contain:

person_id	gender_concept_id	year_of_birth	race_concept_id
1	8507	1980	8527
2	8532	1975	8527
3	8507	1990	8516

Step 3: Read and Validate¶

from omopy.testing import read_patients

# Read Excel, validate, and optionally export to JSON
data = read_patients(
    "tests/fixtures/my_study_patients.xlsx",
    cdm_version="5.4",
    test_name="my_study",
    output_path="tests/fixtures/patients.json",  # Optional JSON export
)
# data is a dict[str, pl.DataFrame] — one DataFrame per CDM table

Step 4: Load into CdmReference¶

from omopy.testing import patients_cdm

cdm = patients_cdm(
    "tests/fixtures/patients.json",
    cdm_version="5.4",
    cdm_name="my_test_cdm",
)

# Use with any OMOPy module
from omopy.profiles import add_demographics
result = add_demographics(cdm["person"], cdm)

Workflow 2: Programmatic Mock CDM¶

For quick unit tests that don't need hand-crafted data:

from omopy.testing import mock_test_cdm

cdm = mock_test_cdm(
    seed=42,
    n_persons=20,
    cdm_version="5.4",
    include_conditions=True,
    include_drugs=True,
    include_measurements=True,
)

# Inspect generated data
print(cdm["person"].collect())
print(cdm["observation_period"].collect())
print(cdm["condition_occurrence"].collect())

Mock CDM Parameters¶

Parameter	Default	Description
`seed`	`42`	Random seed for reproducibility
`n_persons`	`5`	Number of persons to generate
`cdm_version`	`"5.4"`	CDM version (`"5.3"` or `"5.4"`)
`include_conditions`	`True`	Generate condition_occurrence records
`include_drugs`	`True`	Generate drug_exposure records
`include_measurements`	`False`	Generate measurement records

Data Validation¶

Validate a dict of DataFrames against the CDM specification without reading from files:

from omopy.testing import validate_patient_data
import polars as pl

data = {
    "person": pl.DataFrame({
        "person_id": [1, 2],
        "gender_concept_id": [8507, 8532],
        "year_of_birth": [1980, 1975],
        "race_concept_id": [8527, 8527],
        "ethnicity_concept_id": [0, 0],
    }),
    "observation_period": pl.DataFrame({
        "observation_period_id": [1, 2],
        "person_id": [1, 2],
        "observation_period_start_date": ["2020-01-01", "2020-01-01"],
        "observation_period_end_date": ["2023-12-31", "2023-12-31"],
        "period_type_concept_id": [44814724, 44814724],
    }),
}

issues = validate_patient_data(data, cdm_version="5.4")
# Returns validation issues (empty if all valid)

Cohort Timeline Visualization¶

Visualize cohort membership timelines for individual patients:

from omopy.testing import graph_cohort
import polars as pl

# Define cohort data
target_cohort = pl.DataFrame({
    "cohort_definition_id": [1, 1],
    "subject_id": [1, 2],
    "cohort_start_date": ["2020-01-01", "2020-06-01"],
    "cohort_end_date": ["2020-12-31", "2021-05-31"],
})

outcome_cohort = pl.DataFrame({
    "cohort_definition_id": [2, 2],
    "subject_id": [1, 2],
    "cohort_start_date": ["2020-03-15", "2020-09-10"],
    "cohort_end_date": ["2020-03-15", "2020-09-10"],
})

# Plot timeline for subject 1
fig = graph_cohort(
    subject_id=1,
    cohorts={
        "Target": target_cohort,
        "Outcome": outcome_cohort,
    },
)
fig.show()

The timeline plot shows horizontal bars for each cohort the patient belongs to, with start and end dates on the x-axis.

Integration with Other Modules¶

The CdmReference objects produced by patients_cdm() and mock_test_cdm() are Polars-backed (no database). They work with any OMOPy module that accepts a CdmReference:

from omopy.testing import mock_test_cdm
from omopy.characteristics import summarise_characteristics

cdm = mock_test_cdm(seed=42, n_persons=50)

# Create a cohort from the mock data
# (the mock CDM includes observation_period records that can serve as
# a simple cohort definition)

from omopy.generics import CohortTable
import polars as pl

obs = cdm["observation_period"].collect()
cohort_df = obs.select(
    pl.lit(1).cast(pl.Int64).alias("cohort_definition_id"),
    pl.col("person_id").alias("subject_id"),
    pl.col("observation_period_start_date").alias("cohort_start_date"),
    pl.col("observation_period_end_date").alias("cohort_end_date"),
)
settings = pl.DataFrame({
    "cohort_definition_id": [1],
    "cohort_name": ["all_patients"],
})
cohort = CohortTable(cohort_df, settings=settings)
cohort.cdm = cdm

Comparison with R¶

R (TestGenerator)	Python (omopy.testing)
`readPatients()`	`read_patients()`
`validatePatientData()`	`validate_patient_data()`
`patientsCDM()`	`patients_cdm()`
`mockTestCDM()`	`mock_test_cdm()`
`generateTestTables()`	`generate_test_tables()`
`graphCohort()`	`graph_cohort()`
tibbles + R list	Polars DataFrames + CdmReference
readxl / writexl	openpyxl