Skip to content

omopy.testing

Test data generation for OMOP CDM studies — read patient data from Excel/CSV, validate against CDM specifications, construct CdmReference objects, and generate mock test databases.

This module is the Python equivalent of the R TestGenerator package. Excel I/O uses openpyxl; cohort timeline plots use plotly.

Read & Validate

Read patient data from files and validate against CDM specifications.

read_patients

read_patients(
    path: str | Path,
    *,
    cdm_version: str = "5.4",
    test_name: str = "test",
    output_path: str | Path | None = None,
) -> dict[str, pl.DataFrame]

Read patient data from an Excel file or CSV directory.

Auto-detects the format based on the path:

  • If path ends with .xlsx, reads each sheet as a CDM table (sheet name = table name).
  • If path is a directory, reads each .csv file as a CDM table (filename stem = table name).

The data is validated against the CDM specification. If validation fails, a ValueError is raised with all error messages.

If output_path is provided, writes the data as a JSON file (format: {"table_name": [{col: val, ...}, ...], ...}).

Parameters:

Name Type Description Default
path str | Path

Path to an .xlsx file or a directory of .csv files.

required
cdm_version str

CDM version string ("5.3" or "5.4").

'5.4'
test_name str

Name for this test patient set (used in JSON metadata).

'test'
output_path str | Path | None

Optional path to write JSON output.

None

Returns:

Type Description
dict[str, DataFrame]

A dict mapping table names to Polars DataFrames.

Raises:

Type Description
ValueError

If the data fails CDM validation.

FileNotFoundError

If the path does not exist or contains no data.

validate_patient_data

validate_patient_data(
    data: dict[str, DataFrame], *, cdm_version: str = "5.4"
) -> list[str]

Validate patient data against the OMOP CDM specification.

Checks that each table name is a valid CDM table, that column names match the CDM field specs, and that required fields are present.

Parameters:

Name Type Description Default
data dict[str, DataFrame]

Mapping of table name to Polars DataFrame.

required
cdm_version str

CDM version string ("5.3" or "5.4").

'5.4'

Returns:

Type Description
list[str]

A list of error messages. An empty list means the data is valid.

CDM Construction

Build CdmReference objects from JSON test definitions or synthetic data.

patients_cdm

patients_cdm(
    json_path: str | Path,
    *,
    cdm_version: str = "5.4",
    cdm_name: str | None = None,
) -> CdmReference

Load patient data from a JSON file into a CdmReference.

Reads a JSON file with format::

{
    "_meta": {"test_name": "...", "cdm_version": "5.4"},
    "person": [{"person_id": 1, ...}, ...],
    "observation_period": [...]
}

Creates Polars DataFrames for each table and wraps them as CdmTable (or CohortTable when appropriate).

Unlike the R equivalent which downloads an empty Eunomia CDM, this function creates in-memory tables from the JSON data only. Vocabulary tables are not included; use cdm_from_con with a real database if vocabulary tables are needed.

Parameters:

Name Type Description Default
json_path str | Path

Path to the JSON file.

required
cdm_version str

CDM version string ("5.3" or "5.4"). Overridden by the _meta.cdm_version field in the JSON if present.

'5.4'
cdm_name str | None

Human-readable name for this CDM. Defaults to the JSON file stem or _meta.test_name.

None

Returns:

Type Description
CdmReference

A CdmReference backed by in-memory Polars DataFrames.

mock_test_cdm

mock_test_cdm(
    *,
    seed: int = 42,
    n_persons: int = 5,
    cdm_version: str = "5.4",
    include_conditions: bool = True,
    include_drugs: bool = True,
    include_measurements: bool = False,
) -> CdmReference

Create a small mock CDM with synthetic data for testing.

Generates realistic-looking synthetic data for person, observation_period, and optionally condition_occurrence, drug_exposure, and measurement tables.

This requires no database or file I/O — everything is created in-memory as Polars DataFrames.

Parameters:

Name Type Description Default
seed int

Random seed for reproducibility.

42
n_persons int

Number of persons to generate.

5
cdm_version str

CDM version string ("5.3" or "5.4").

'5.4'
include_conditions bool

Whether to generate condition_occurrence.

True
include_drugs bool

Whether to generate drug_exposure.

True
include_measurements bool

Whether to generate measurement.

False

Returns:

Type Description
CdmReference

A CdmReference backed by in-memory Polars DataFrames.

Template Generation

Generate blank Excel templates with CDM-compliant column headers.

generate_test_tables

generate_test_tables(
    table_names: list[str],
    *,
    cdm_version: str = "5.4",
    output_path: str | Path = ".",
    filename: str | None = None,
) -> Path

Generate an empty Excel file with sheets for specified CDM tables.

Each sheet contains the correct column headers from the CDM specification. Vocabulary tables (concept, concept_ancestor, etc.) are excluded automatically.

Parameters:

Name Type Description Default
table_names list[str]

List of CDM table names to include as sheets.

required
cdm_version str

CDM version string ("5.3" or "5.4").

'5.4'
output_path str | Path

Directory where the file will be created.

'.'
filename str | None

Output filename. Defaults to "test_patients_v{version}.xlsx".

None

Returns:

Type Description
Path

Path to the created Excel file.

Raises:

Type Description
ValueError

If any table name is not a valid CDM table, or if a vocabulary table is requested.

Visualization

Plot cohort membership timelines for individual patients.

graph_cohort

graph_cohort(
    subject_id: int,
    cohorts: dict[str, DataFrame],
    *,
    style: Any | None = None,
) -> Any

Plot cohort timelines for a single subject.

Each cohort is a named DataFrame with columns cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. This function draws a horizontal segment for each cohort entry for the given subject_id.

Parameters:

Name Type Description Default
subject_id int

The subject to visualize.

required
cohorts dict[str, DataFrame]

Mapping of cohort name to cohort DataFrame.

required
style Any | None

Optional Plotly layout overrides (dict or None).

None

Returns:

Type Description
Any

A plotly.graph_objects.Figure.

Raises:

Type Description
ValueError

If no cohort records found for the subject, or if required columns are missing.