omopy.generics¶

Core type system for OMOPy — foundational types that all other modules depend on.

This module is the Python equivalent of R's omopgenerics package.

CDM Container Classes¶

CdmReference¶

CdmReference ¶

CdmReference(
    tables: dict[str, CdmTable] | None = None,
    *,
    cdm_version: CdmVersion = CdmVersion.V5_4,
    cdm_name: str = "",
    cdm_source: CdmSource | None = None,
)

Top-level container for an OMOP CDM instance.

Holds a collection of named CDM tables and optional source metadata. Behaves like a dict: cdm["person"] returns the person CdmTable.

Usage::

cdm = CdmReference(
    tables={"person": person_tbl, "observation_period": obs_tbl},
    cdm_version=CdmVersion.V5_4,
    cdm_name="my_cdm",
)
person = cdm["person"]
cdm["my_cohort"] = my_cohort_table  # insert new table

cdm_version `property` ¶

cdm_version: CdmVersion

The OMOP CDM version (5.3 or 5.4).

cdm_name `property` `writable` ¶

cdm_name: str

Human-readable name for this CDM instance.

cdm_source `property` ¶

cdm_source: CdmSource | None

The backend source, if any.

table_names `property` ¶

table_names: list[str]

Names of all tables currently in the CDM.

cohort_tables `property` ¶

cohort_tables: dict[str, CohortTable]

All tables that are CohortTable instances.

snapshot ¶

snapshot() -> dict[str, Any]

Return a summary snapshot of the CDM (table names, row counts, etc.).

select_tables ¶

select_tables(names: list[str]) -> CdmReference

Create a new CdmReference with only the specified tables.

CdmSource¶

CdmSource ¶

Bases: Protocol

Protocol for CDM data sources (database backends, local files, etc.).

Implementations in later phases: - DbSource (Phase 1): database-backed via Ibis/SQLAlchemy - LocalCdm (Phase 0): in-memory Polars DataFrames

This protocol defines the minimal interface that CdmReference needs from its backend.

source_type `property` ¶

source_type: str

Backend identifier (e.g. 'local', 'duckdb', 'postgres').

list_tables ¶

list_tables() -> list[str]

Return names of all available tables in the source.

read_table ¶

read_table(table_name: str) -> CdmTable

Read a table from the source, returning a CdmTable.

write_table ¶

write_table(
    table: CdmTable, table_name: str | None = None
) -> None

Write/compute a table into the source.

drop_table ¶

drop_table(table_name: str) -> None

Drop a table from the source.

CdmTable¶

CdmTable ¶

CdmTable(
    data: DataFrame | LazyFrame | Any,
    *,
    tbl_name: str,
    tbl_source: str = "local",
    cdm: CdmReference | None = None,
)

A named table in an OMOP CDM, wrapping a concrete data source.

The class preserves three key pieces of metadata through transformations:

tbl_name — canonical CDM table name (e.g. "person").
tbl_source — string identifier for the source (e.g. "duckdb").
cdm — weak back-reference to the parent :class:CdmReference.

Creating derived tables (filter, join, etc.) should use :meth:_with_data to produce a new CdmTable that inherits the metadata.

data `property` ¶

data: DataFrame | LazyFrame | Any

The underlying data (Polars DF/LF or Ibis table expression).

tbl_name `property` ¶

tbl_name: str

Canonical CDM table name.

tbl_source `property` ¶

tbl_source: str

Source identifier (e.g. 'local', 'duckdb', 'postgres').

cdm `property` `writable` ¶

cdm: CdmReference | None

Back-reference to the parent CDM reference, if any.

columns `property` ¶

columns: list[str]

Column names of the underlying data.

schema `property` ¶

schema: dict[str, Any]

Column name -> dtype mapping.

filter ¶

filter(*predicates: Any, **named_predicates: Any) -> Self

Filter rows, preserving CdmTable metadata.

select ¶

select(*exprs: Any, **named_exprs: Any) -> Self

Select columns, preserving CdmTable metadata.

rename ¶

rename(mapping: dict[str, str]) -> Self

Rename columns, preserving CdmTable metadata.

join ¶

join(
    other: CdmTable | DataFrame | LazyFrame | Any,
    on: str | list[str] | None = None,
    how: str = "inner",
    **kwargs: Any,
) -> Self

Join with another table, preserving this table's metadata.

head ¶

head(n: int = 5) -> Self

Return first n rows, preserving metadata.

collect ¶

collect() -> pl.DataFrame

Materialize the data to a Polars DataFrame.

For lazy sources (LazyFrame, Ibis), this triggers execution. Uses PyArrow as the zero-copy interchange format when available.

count ¶

count() -> int

Return the number of rows.

For Ibis-backed tables, uses the database's COUNT(*) rather than materialising the full table.

CohortTable¶

CohortTable ¶

CohortTable(
    data: DataFrame | LazyFrame | Any,
    *,
    tbl_name: str = "cohort",
    tbl_source: str = "local",
    cdm: CdmReference | None = None,
    settings: DataFrame | None = None,
    attrition: DataFrame | None = None,
    cohort_codelist: DataFrame | None = None,
)

Bases: CdmTable

A specialised CDM table representing a generated cohort.

Extends :class:CdmTable with three pieces of companion metadata:

settings — A DataFrame mapping cohort_definition_id to cohort_name (and possibly other columns).
attrition — A DataFrame tracking inclusion/exclusion at each step.
cohort_codelist — A :class:Codelist of concept IDs used to generate each cohort.

These mirror the R cohort_set, cohort_attrition, and cohort_codelist attributes.

settings `property` `writable` ¶

settings: DataFrame

Cohort settings (cohort_set) DataFrame.

attrition `property` `writable` ¶

attrition: DataFrame

Cohort attrition DataFrame.

cohort_codelist `property` `writable` ¶

cohort_codelist: DataFrame

Cohort codelist DataFrame.

cohort_ids `property` ¶

cohort_ids: list[int]

Distinct cohort definition IDs from the settings.

cohort_names `property` ¶

cohort_names: list[str]

Cohort names from the settings.

cohort_count ¶

cohort_count() -> pl.DataFrame

Compute number of records and subjects per cohort definition.

Codelist Types¶

Codelist¶

Codelist ¶

Codelist(
    data: dict[str, list[int]] | None = None,
    /,
    **kwargs: list[int],
)

Bases: dict[str, list[int]]

A named collection of concept ID lists.

Inherits from dict[str, list[int]]. Keys are codelist names, values are lists of integer concept IDs.

Usage::

cl = Codelist({"diabetes": [201826, 442793], "hypertension": [316866]})
assert "diabetes" in cl
assert cl["diabetes"] == [201826, 442793]

names `property` ¶

names: list[str]

Return codelist names.

all_concept_ids `property` ¶

all_concept_ids: set[int]

Return all unique concept IDs across all codelists.

ConceptEntry¶

ConceptEntry ¶

Bases: BaseModel

A single concept within a concept set expression.

Matches the ATLAS JSON format::

{
  "concept": {"CONCEPT_ID": 123, "CONCEPT_NAME": "Foo", ...},
  "isExcluded": false,
  "includeDescendants": true,
  "includeMapped": false
}

ConceptSetExpression¶

ConceptSetExpression ¶

ConceptSetExpression(
    data: dict[str, list[ConceptEntry]] | None = None,
    /,
    **kwargs: list[ConceptEntry],
)

Bases: dict[str, list[ConceptEntry]]

A named collection of concept set expressions (with flags).

Each entry includes concept metadata plus is_excluded, include_descendants, and include_mapped flags.

Usage::

cse = ConceptSetExpression({
    "diabetes": [
        ConceptEntry(concept_id=201826, include_descendants=True),
        ConceptEntry(concept_id=442793, is_excluded=True),
    ]
})

to_codelist ¶

to_codelist() -> Codelist

Convert to a simple Codelist.

Drops flags, keeping only included concepts.

Summarised Results¶

SummarisedResult¶

SummarisedResult ¶

SummarisedResult(
    data: DataFrame, *, settings: DataFrame | None = None
)

Standard OHDSI summarised result format.

Wraps a Polars DataFrame with the 13 required columns plus a companion settings DataFrame. Provides methods for:

Suppression (suppress)
Splitting name-level pairs (split_group, split_strata, etc.)
Uniting columns into name-level pairs (unite_group, unite_strata, etc.)
Pivoting estimates (pivot_estimates)
Adding settings (add_settings)
Filtering by settings, strata, or group values

data `property` ¶

data: DataFrame

The underlying result DataFrame.

settings `property` `writable` ¶

settings: DataFrame

The companion settings DataFrame.

suppress ¶

suppress(min_cell_count: int = 5) -> SummarisedResult

Suppress estimate values where counts are below min_cell_count.

Following the R implementation: 1. Identify rows where variable_name is in GROUP_COUNT_VARIABLES and estimate_value < min_cell_count. 2. Mark those result_id + group + strata + variable combinations. 3. Set estimate_value to "-" (suppressed sentinel) for those rows and linked percentage rows.

split_group ¶

split_group() -> pl.DataFrame

Split group_name/group_level into individual columns.

split_strata ¶

split_strata() -> pl.DataFrame

Split strata_name/strata_level into individual columns.

split_additional ¶

split_additional() -> pl.DataFrame

Split additional_name/additional_level into individual columns.

split_all ¶

split_all() -> pl.DataFrame

Split all name-level pair columns.

unite_group ¶

unite_group(columns: list[str]) -> SummarisedResult

Unite columns into group_name/group_level.

unite_strata ¶

unite_strata(columns: list[str]) -> SummarisedResult

Unite columns into strata_name/strata_level.

unite_additional ¶

unite_additional(columns: list[str]) -> SummarisedResult

Unite columns into additional_name/additional_level.

pivot_estimates ¶

pivot_estimates() -> pl.DataFrame

Pivot estimate_name/estimate_value into wide format.

Each unique estimate_name becomes a column, with values from estimate_value, cast according to estimate_type.

add_settings ¶

add_settings(
    columns: list[str] | None = None,
) -> pl.DataFrame

Join settings columns to the result data.

If columns is None, all settings columns are joined.

filter_settings ¶

filter_settings(**kwargs: Any) -> SummarisedResult

Filter by settings values.

Example::

result.filter_settings(result_type="cohort_count")

filter_group ¶

filter_group(**kwargs: str) -> SummarisedResult

Filter by group name-level pairs.

filter_strata ¶

filter_strata(**kwargs: str) -> SummarisedResult

Filter by strata name-level pairs.

filter_additional ¶

filter_additional(**kwargs: str) -> SummarisedResult

Filter by additional name-level pairs.

tidy ¶

tidy() -> pl.DataFrame

Convert to a tidy DataFrame.

Add settings + split all name-level pairs + pivot.

Schema Definitions¶

CdmSchema¶

CdmSchema ¶

CdmSchema(version: CdmVersion = CdmVersion.V5_4)

Registry for OMOP CDM schema specifications.

All data is lazily loaded and cached at the class level on first access.

Usage::

schema = CdmSchema(CdmVersion.V5_4)
person_fields = schema.fields_for_table("person")
required_tables = schema.required_table_names()

field_specs `property` ¶

field_specs: tuple[FieldSpec, ...]

All field specs for this CDM version.

table_specs `property` ¶

table_specs: tuple[TableSpec, ...]

All table-level specs for this CDM version.

result_field_specs `property` ¶

result_field_specs: tuple[ResultFieldSpec, ...]

Specs for summarised_result / settings fields.

field_table_columns `property` ¶

field_table_columns: tuple[_FieldTableColumn, ...]

Semantic column mappings for clinical tables.

fields_for_table ¶

fields_for_table(table_name: str) -> tuple[FieldSpec, ...]

Return field specs for a specific table.

required_fields_for_table ¶

required_fields_for_table(
    table_name: str,
) -> tuple[FieldSpec, ...]

Return only required field specs for a table.

table_names ¶

table_names(
    *, table_type: TableType | None = None
) -> tuple[str, ...]

Return all table names, optionally filtered by type.

required_table_names ¶

required_table_names() -> tuple[str, ...]

Return names of tables marked as required at table level.

table_names_in_group ¶

table_names_in_group(group: TableGroup) -> tuple[str, ...]

Return table names belonging to a logical group.

table_spec_for ¶

table_spec_for(table_name: str) -> TableSpec | None

Return the TableSpec for a specific table, or None.

field_column_info ¶

field_column_info(
    table_name: str,
) -> _FieldTableColumn | None

Get semantic column mapping for a clinical table.

validate_columns ¶

validate_columns(
    table_name: str,
    columns: Sequence[str],
    *,
    check_required: bool = True,
) -> list[str]

Validate columns against the spec. Returns list of error messages.

Checks: 1. If check_required, all required columns must be present. 2. (Warning-level) Extra columns not in spec are noted.

FieldSpec¶

FieldSpec ¶

Bases: BaseModel

A single field in a CDM table (from fieldsTables).

varchar_length `property` ¶

varchar_length: int | None

Extract max length from varchar(N) or varchar(max).

TableSpec¶

TableSpec ¶

Bases: BaseModel

Table-level metadata from the CDM spec CSVs.

ResultFieldSpec¶

ResultFieldSpec ¶

Bases: BaseModel

Field specification for a summarised/compared result.

Enums¶

CdmVersion¶

CdmVersion ¶

Bases: StrEnum

Supported OMOP CDM versions.

CdmDataType¶

CdmDataType ¶

Bases: StrEnum

Data types used in OMOP CDM field specifications.

from_spec `classmethod` ¶

from_spec(raw: str) -> CdmDataType

Parse a CDM datatype string like 'varchar(50)' or 'integer'.

TableType¶

TableType ¶

Bases: StrEnum

Classification of CDM table types.

TableGroup¶

TableGroup ¶

Bases: StrEnum

Logical groupings of CDM tables for batch selection.

TableSchema¶

TableSchema ¶

Bases: StrEnum

Database schema a CDM table lives in.

Type Aliases & Constants¶

CdmVersionLiteral¶

CdmVersionLiteral `module-attribute` ¶

CdmVersionLiteral = Literal['5.3', '5.4']

SUPPORTED_CDM_VERSIONS¶

SUPPORTED_CDM_VERSIONS `module-attribute` ¶

SUPPORTED_CDM_VERSIONS: tuple[str, ...] = ('5.3', '5.4')

NAME_LEVEL_SEP¶

NAME_LEVEL_SEP `module-attribute` ¶

NAME_LEVEL_SEP: str = ' &&& '

OVERALL¶

OVERALL `module-attribute` ¶

OVERALL: str = 'overall'

COHORT_REQUIRED_COLUMNS¶

COHORT_REQUIRED_COLUMNS `module-attribute` ¶

COHORT_REQUIRED_COLUMNS: tuple[str, ...] = (
    "cohort_definition_id",
    "subject_id",
    "cohort_start_date",
    "cohort_end_date",
)

SUMMARISED_RESULT_COLUMNS¶

SUMMARISED_RESULT_COLUMNS `module-attribute` ¶

SUMMARISED_RESULT_COLUMNS: tuple[str, ...] = (
    "result_id",
    "cdm_name",
    "group_name",
    "group_level",
    "strata_name",
    "strata_level",
    "variable_name",
    "variable_level",
    "estimate_name",
    "estimate_type",
    "estimate_value",
    "additional_name",
    "additional_level",
)

SETTINGS_REQUIRED_COLUMNS¶

SETTINGS_REQUIRED_COLUMNS `module-attribute` ¶

SETTINGS_REQUIRED_COLUMNS: tuple[str, ...] = (
    "result_id",
    "result_type",
    "package_name",
    "package_version",
)

GROUP_COUNT_VARIABLES¶

GROUP_COUNT_VARIABLES `module-attribute` ¶

GROUP_COUNT_VARIABLES: tuple[str, ...] = (
    "number subjects",
    "number records",
)

Validation Functions¶

assert_character ¶

assert_character(
    value: Any,
    *,
    name: str = "value",
    min_length: int | None = None,
    max_length: int | None = None,
    na_allowed: bool = True,
    null_allowed: bool = False,
) -> None

Assert value is a string or sequence of strings.

assert_choice ¶

assert_choice(
    value: Any,
    choices: Sequence[Any],
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is one of the given choices.

assert_class ¶

assert_class(
    value: Any,
    cls: type | tuple[type, ...],
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is an instance of cls.

assert_date ¶

assert_date(
    value: Any,
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is a datetime.date (or datetime).

assert_list ¶

assert_list(
    value: Any,
    *,
    name: str = "value",
    element_class: type | None = None,
    min_length: int | None = None,
    null_allowed: bool = False,
) -> None

Assert value is a list (or sequence).

assert_logical ¶

assert_logical(
    value: Any,
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is a boolean.

assert_numeric ¶

assert_numeric(
    value: Any,
    *,
    name: str = "value",
    min_val: int | float | None = None,
    max_val: int | float | None = None,
    null_allowed: bool = False,
) -> None

Assert value is numeric (int or float).

assert_true ¶

assert_true(
    condition: bool, *, msg: str = "Assertion failed"
) -> None

Assert a boolean condition is True.

assert_table_columns ¶

assert_table_columns(
    columns: Sequence[str],
    required: Sequence[str],
    *,
    table_name: str = "table",
) -> None

Assert all required columns are present in columns.

I/O Functions¶

export_codelist ¶

export_codelist(
    codelist: Codelist,
    path: str | Path,
    *,
    format: str = "csv",
) -> Path

Export a Codelist to a file.

Parameters:

Name	Type	Description	Default
`codelist`	`Codelist`	The codelist to export.	required
`path`	`str \| Path`	Directory to write files to.	required
`format`	`str`	`'csv'` writes a single CSV; `'json'` writes one JSON per codelist entry (ATLAS format).	`'csv'`

import_codelist ¶

import_codelist(
    path: str | Path, *, format: str | None = None
) -> Codelist

Import a Codelist from file(s).

If path is a CSV file, reads it (expects codelist_name, concept_id). If path is a directory, reads all .json files as individual concept sets.

export_concept_set_expression ¶

export_concept_set_expression(
    cse: ConceptSetExpression,
    path: str | Path,
    *,
    format: str = "json",
) -> Path

Export a ConceptSetExpression to JSON files (one per concept set).

import_concept_set_expression ¶

import_concept_set_expression(
    path: str | Path, *, format: str | None = None
) -> ConceptSetExpression

Import a ConceptSetExpression from JSON file(s) or a CSV.

export_summarised_result ¶

export_summarised_result(
    result: SummarisedResult,
    path: str | Path,
    *,
    min_cell_count: int = 5,
) -> Path

Export a SummarisedResult to a CSV file.

Applies suppression before export. Settings are stored as additional rows in the same CSV with a special marker column.

import_summarised_result ¶

import_summarised_result(
    path: str | Path,
) -> SummarisedResult

Import a SummarisedResult from a CSV file.

omopy.generics¶

CDM Container Classes¶

CdmReference¶

CdmReference ¶

cdm_version property ¶

cdm_name property writable ¶

cdm_source property ¶

table_names property ¶

cohort_tables property ¶

snapshot ¶

select_tables ¶

CdmSource¶

CdmSource ¶

source_type property ¶

list_tables ¶

read_table ¶

write_table ¶

drop_table ¶

CdmTable¶

CdmTable ¶

data property ¶

tbl_name property ¶

tbl_source property ¶

cdm property writable ¶

columns property ¶

schema property ¶

filter ¶

select ¶

rename ¶

join ¶

head ¶

collect ¶

count ¶

CohortTable¶

CohortTable ¶

settings property writable ¶

attrition property writable ¶

cohort_codelist property writable ¶

cohort_ids property ¶

cohort_names property ¶

cohort_count ¶

Codelist Types¶

Codelist¶

Codelist ¶

names property ¶

all_concept_ids property ¶

ConceptEntry¶

ConceptEntry ¶

ConceptSetExpression¶

ConceptSetExpression ¶

to_codelist ¶

Summarised Results¶

SummarisedResult¶

SummarisedResult ¶

data property ¶

settings property writable ¶

suppress ¶

split_group ¶

split_strata ¶

split_additional ¶

split_all ¶

unite_group ¶

unite_strata ¶

unite_additional ¶

pivot_estimates ¶

add_settings ¶

filter_settings ¶

filter_group ¶

filter_strata ¶

filter_additional ¶

tidy ¶

Schema Definitions¶

CdmSchema¶

CdmSchema ¶

field_specs property ¶

table_specs property ¶

result_field_specs property ¶

field_table_columns property ¶

fields_for_table ¶

required_fields_for_table ¶

cdm_version `property` ¶

cdm_name `property` `writable` ¶

cdm_source `property` ¶

table_names `property` ¶

cohort_tables `property` ¶

source_type `property` ¶

data `property` ¶

tbl_name `property` ¶

tbl_source `property` ¶

cdm `property` `writable` ¶

columns `property` ¶

schema `property` ¶

settings `property` `writable` ¶

attrition `property` `writable` ¶

cohort_codelist `property` `writable` ¶

cohort_ids `property` ¶

cohort_names `property` ¶

names `property` ¶

all_concept_ids `property` ¶

data `property` ¶

settings `property` `writable` ¶

field_specs `property` ¶

table_specs `property` ¶

result_field_specs `property` ¶

field_table_columns `property` ¶

varchar_length `property` ¶

from_spec `classmethod` ¶

CdmVersionLiteral `module-attribute` ¶

SUPPORTED_CDM_VERSIONS `module-attribute` ¶

NAME_LEVEL_SEP `module-attribute` ¶

OVERALL `module-attribute` ¶

COHORT_REQUIRED_COLUMNS `module-attribute` ¶

SUMMARISED_RESULT_COLUMNS `module-attribute` ¶

SETTINGS_REQUIRED_COLUMNS `module-attribute` ¶

GROUP_COUNT_VARIABLES `module-attribute` ¶