omopy.testing¶

Test data generation for OMOP CDM studies — read patient data from Excel/CSV, validate against CDM specifications, construct CdmReference objects, and generate mock test databases.

This module is the Python equivalent of the R TestGenerator package. Excel I/O uses openpyxl; cohort timeline plots use plotly.

Read & Validate¶

Read patient data from files and validate against CDM specifications.

read_patients ¶

read_patients(
    path: str | Path,
    *,
    cdm_version: str = "5.4",
    test_name: str = "test",
    output_path: str | Path | None = None,
) -> dict[str, pl.DataFrame]

Read patient data from an Excel file or CSV directory.

Auto-detects the format based on the path:

If path ends with .xlsx, reads each sheet as a CDM table (sheet name = table name).
If path is a directory, reads each .csv file as a CDM table (filename stem = table name).

The data is validated against the CDM specification. If validation fails, a ValueError is raised with all error messages.

If output_path is provided, writes the data as a JSON file (format: {"table_name": [{col: val, ...}, ...], ...}).

Parameters:

Name	Type	Description	Default
`path`	`str \| Path`	Path to an `.xlsx` file or a directory of `.csv` files.	required
`cdm_version`	`str`	CDM version string (`"5.3"` or `"5.4"`).	`'5.4'`
`test_name`	`str`	Name for this test patient set (used in JSON metadata).	`'test'`
`output_path`	`str \| Path \| None`	Optional path to write JSON output.	`None`

Returns:

Type	Description
`dict[str, DataFrame]`	A dict mapping table names to Polars DataFrames.

Raises:

Type	Description
`ValueError`	If the data fails CDM validation.
`FileNotFoundError`	If the path does not exist or contains no data.

validate_patient_data ¶

validate_patient_data(
    data: dict[str, DataFrame], *, cdm_version: str = "5.4"
) -> list[str]

Validate patient data against the OMOP CDM specification.

Checks that each table name is a valid CDM table, that column names match the CDM field specs, and that required fields are present.

Parameters:

Name	Type	Description	Default
`data`	`dict[str, DataFrame]`	Mapping of table name to Polars DataFrame.	required
`cdm_version`	`str`	CDM version string (`"5.3"` or `"5.4"`).	`'5.4'`

Returns:

Type	Description
`list[str]`	A list of error messages. An empty list means the data is valid.

CDM Construction¶

Build CdmReference objects from JSON test definitions or synthetic data.

patients_cdm ¶

patients_cdm(
    json_path: str | Path,
    *,
    cdm_version: str = "5.4",
    cdm_name: str | None = None,
) -> CdmReference

Load patient data from a JSON file into a CdmReference.

Reads a JSON file with format::

{
    "_meta": {"test_name": "...", "cdm_version": "5.4"},
    "person": [{"person_id": 1, ...}, ...],
    "observation_period": [...]
}

Creates Polars DataFrames for each table and wraps them as CdmTable (or CohortTable when appropriate).

Unlike the R equivalent which downloads an empty Eunomia CDM, this function creates in-memory tables from the JSON data only. Vocabulary tables are not included; use cdm_from_con with a real database if vocabulary tables are needed.

Parameters:

Name	Type	Description	Default
`json_path`	`str \| Path`	Path to the JSON file.	required
`cdm_version`	`str`	CDM version string (`"5.3"` or `"5.4"`). Overridden by the `_meta.cdm_version` field in the JSON if present.	`'5.4'`
`cdm_name`	`str \| None`	Human-readable name for this CDM. Defaults to the JSON file stem or `_meta.test_name`.	`None`

Returns:

Type	Description
`CdmReference`	A `CdmReference` backed by in-memory Polars DataFrames.

mock_test_cdm ¶

mock_test_cdm(
    *,
    seed: int = 42,
    n_persons: int = 5,
    cdm_version: str = "5.4",
    include_conditions: bool = True,
    include_drugs: bool = True,
    include_measurements: bool = False,
) -> CdmReference

Create a small mock CDM with synthetic data for testing.

Generates realistic-looking synthetic data for person, observation_period, and optionally condition_occurrence, drug_exposure, and measurement tables.

This requires no database or file I/O — everything is created in-memory as Polars DataFrames.

Parameters:

Name	Type	Description	Default
`seed`	`int`	Random seed for reproducibility.	`42`
`n_persons`	`int`	Number of persons to generate.	`5`
`cdm_version`	`str`	CDM version string (`"5.3"` or `"5.4"`).	`'5.4'`
`include_conditions`	`bool`	Whether to generate `condition_occurrence`.	`True`
`include_drugs`	`bool`	Whether to generate `drug_exposure`.	`True`
`include_measurements`	`bool`	Whether to generate `measurement`.	`False`

Returns:

Type	Description
`CdmReference`	A `CdmReference` backed by in-memory Polars DataFrames.

Template Generation¶

Generate blank Excel templates with CDM-compliant column headers.

generate_test_tables ¶

generate_test_tables(
    table_names: list[str],
    *,
    cdm_version: str = "5.4",
    output_path: str | Path = ".",
    filename: str | None = None,
) -> Path

Generate an empty Excel file with sheets for specified CDM tables.

Each sheet contains the correct column headers from the CDM specification. Vocabulary tables (concept, concept_ancestor, etc.) are excluded automatically.

Parameters:

Name	Type	Description	Default
`table_names`	`list[str]`	List of CDM table names to include as sheets.	required
`cdm_version`	`str`	CDM version string (`"5.3"` or `"5.4"`).	`'5.4'`
`output_path`	`str \| Path`	Directory where the file will be created.	`'.'`
`filename`	`str \| None`	Output filename. Defaults to `"test_patients_v{version}.xlsx"`.	`None`

Returns:

Type	Description
`Path`	Path to the created Excel file.

Raises:

Type	Description
`ValueError`	If any table name is not a valid CDM table, or if a vocabulary table is requested.

Visualization¶

Plot cohort membership timelines for individual patients.

graph_cohort ¶

graph_cohort(
    subject_id: int,
    cohorts: dict[str, DataFrame],
    *,
    style: Any | None = None,
) -> Any

Plot cohort timelines for a single subject.

Each cohort is a named DataFrame with columns cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. This function draws a horizontal segment for each cohort entry for the given subject_id.

Parameters:

Name	Type	Description	Default
`subject_id`	`int`	The subject to visualize.	required
`cohorts`	`dict[str, DataFrame]`	Mapping of cohort name to cohort DataFrame.	required
`style`	`Any \| None`	Optional Plotly layout overrides (dict or `None`).	`None`

Returns:

Type	Description
`Any`	A `plotly.graph_objects.Figure`.

Raises:

Type	Description
`ValueError`	If no cohort records found for the subject, or if required columns are missing.