Skip to content

omopy.generics

Core type system for OMOPy — foundational types that all other modules depend on.

This module is the Python equivalent of R's omopgenerics package.

CDM Container Classes

CdmReference

CdmReference

CdmReference(
    tables: dict[str, CdmTable] | None = None,
    *,
    cdm_version: CdmVersion = CdmVersion.V5_4,
    cdm_name: str = "",
    cdm_source: CdmSource | None = None,
)

Top-level container for an OMOP CDM instance.

Holds a collection of named CDM tables and optional source metadata. Behaves like a dict: cdm["person"] returns the person CdmTable.

Usage::

cdm = CdmReference(
    tables={"person": person_tbl, "observation_period": obs_tbl},
    cdm_version=CdmVersion.V5_4,
    cdm_name="my_cdm",
)
person = cdm["person"]
cdm["my_cohort"] = my_cohort_table  # insert new table

cdm_version property

cdm_version: CdmVersion

The OMOP CDM version (5.3 or 5.4).

cdm_name property writable

cdm_name: str

Human-readable name for this CDM instance.

cdm_source property

cdm_source: CdmSource | None

The backend source, if any.

table_names property

table_names: list[str]

Names of all tables currently in the CDM.

cohort_tables property

cohort_tables: dict[str, CohortTable]

All tables that are CohortTable instances.

snapshot

snapshot() -> dict[str, Any]

Return a summary snapshot of the CDM (table names, row counts, etc.).

select_tables

select_tables(names: list[str]) -> CdmReference

Create a new CdmReference with only the specified tables.

CdmSource

CdmSource

Bases: Protocol

Protocol for CDM data sources (database backends, local files, etc.).

Implementations in later phases: - DbSource (Phase 1): database-backed via Ibis/SQLAlchemy - LocalCdm (Phase 0): in-memory Polars DataFrames

This protocol defines the minimal interface that CdmReference needs from its backend.

source_type property

source_type: str

Backend identifier (e.g. 'local', 'duckdb', 'postgres').

list_tables

list_tables() -> list[str]

Return names of all available tables in the source.

read_table

read_table(table_name: str) -> CdmTable

Read a table from the source, returning a CdmTable.

write_table

write_table(
    table: CdmTable, table_name: str | None = None
) -> None

Write/compute a table into the source.

drop_table

drop_table(table_name: str) -> None

Drop a table from the source.

CdmTable

CdmTable

CdmTable(
    data: DataFrame | LazyFrame | Any,
    *,
    tbl_name: str,
    tbl_source: str = "local",
    cdm: CdmReference | None = None,
)

A named table in an OMOP CDM, wrapping a concrete data source.

The class preserves three key pieces of metadata through transformations:

  • tbl_name — canonical CDM table name (e.g. "person").
  • tbl_source — string identifier for the source (e.g. "duckdb").
  • cdm — weak back-reference to the parent :class:CdmReference.

Creating derived tables (filter, join, etc.) should use :meth:_with_data to produce a new CdmTable that inherits the metadata.

data property

data: DataFrame | LazyFrame | Any

The underlying data (Polars DF/LF or Ibis table expression).

tbl_name property

tbl_name: str

Canonical CDM table name.

tbl_source property

tbl_source: str

Source identifier (e.g. 'local', 'duckdb', 'postgres').

cdm property writable

cdm: CdmReference | None

Back-reference to the parent CDM reference, if any.

columns property

columns: list[str]

Column names of the underlying data.

schema property

schema: dict[str, Any]

Column name -> dtype mapping.

filter

filter(*predicates: Any, **named_predicates: Any) -> Self

Filter rows, preserving CdmTable metadata.

select

select(*exprs: Any, **named_exprs: Any) -> Self

Select columns, preserving CdmTable metadata.

rename

rename(mapping: dict[str, str]) -> Self

Rename columns, preserving CdmTable metadata.

join

join(
    other: CdmTable | DataFrame | LazyFrame | Any,
    on: str | list[str] | None = None,
    how: str = "inner",
    **kwargs: Any,
) -> Self

Join with another table, preserving this table's metadata.

head

head(n: int = 5) -> Self

Return first n rows, preserving metadata.

collect

collect() -> pl.DataFrame

Materialize the data to a Polars DataFrame.

For lazy sources (LazyFrame, Ibis), this triggers execution. Uses PyArrow as the zero-copy interchange format when available.

count

count() -> int

Return the number of rows.

For Ibis-backed tables, uses the database's COUNT(*) rather than materialising the full table.

CohortTable

CohortTable

CohortTable(
    data: DataFrame | LazyFrame | Any,
    *,
    tbl_name: str = "cohort",
    tbl_source: str = "local",
    cdm: CdmReference | None = None,
    settings: DataFrame | None = None,
    attrition: DataFrame | None = None,
    cohort_codelist: DataFrame | None = None,
)

Bases: CdmTable

A specialised CDM table representing a generated cohort.

Extends :class:CdmTable with three pieces of companion metadata:

  • settings — A DataFrame mapping cohort_definition_id to cohort_name (and possibly other columns).
  • attrition — A DataFrame tracking inclusion/exclusion at each step.
  • cohort_codelist — A :class:Codelist of concept IDs used to generate each cohort.

These mirror the R cohort_set, cohort_attrition, and cohort_codelist attributes.

settings property writable

settings: DataFrame

Cohort settings (cohort_set) DataFrame.

attrition property writable

attrition: DataFrame

Cohort attrition DataFrame.

cohort_codelist property writable

cohort_codelist: DataFrame

Cohort codelist DataFrame.

cohort_ids property

cohort_ids: list[int]

Distinct cohort definition IDs from the settings.

cohort_names property

cohort_names: list[str]

Cohort names from the settings.

cohort_count

cohort_count() -> pl.DataFrame

Compute number of records and subjects per cohort definition.

Codelist Types

Codelist

Codelist

Codelist(
    data: dict[str, list[int]] | None = None,
    /,
    **kwargs: list[int],
)

Bases: dict[str, list[int]]

A named collection of concept ID lists.

Inherits from dict[str, list[int]]. Keys are codelist names, values are lists of integer concept IDs.

Usage::

cl = Codelist({"diabetes": [201826, 442793], "hypertension": [316866]})
assert "diabetes" in cl
assert cl["diabetes"] == [201826, 442793]

names property

names: list[str]

Return codelist names.

all_concept_ids property

all_concept_ids: set[int]

Return all unique concept IDs across all codelists.

ConceptEntry

ConceptEntry

Bases: BaseModel

A single concept within a concept set expression.

Matches the ATLAS JSON format::

{
  "concept": {"CONCEPT_ID": 123, "CONCEPT_NAME": "Foo", ...},
  "isExcluded": false,
  "includeDescendants": true,
  "includeMapped": false
}

ConceptSetExpression

ConceptSetExpression

ConceptSetExpression(
    data: dict[str, list[ConceptEntry]] | None = None,
    /,
    **kwargs: list[ConceptEntry],
)

Bases: dict[str, list[ConceptEntry]]

A named collection of concept set expressions (with flags).

Each entry includes concept metadata plus is_excluded, include_descendants, and include_mapped flags.

Usage::

cse = ConceptSetExpression({
    "diabetes": [
        ConceptEntry(concept_id=201826, include_descendants=True),
        ConceptEntry(concept_id=442793, is_excluded=True),
    ]
})

to_codelist

to_codelist() -> Codelist

Convert to a simple Codelist.

Drops flags, keeping only included concepts.

Summarised Results

SummarisedResult

SummarisedResult

SummarisedResult(
    data: DataFrame, *, settings: DataFrame | None = None
)

Standard OHDSI summarised result format.

Wraps a Polars DataFrame with the 13 required columns plus a companion settings DataFrame. Provides methods for:

  • Suppression (suppress)
  • Splitting name-level pairs (split_group, split_strata, etc.)
  • Uniting columns into name-level pairs (unite_group, unite_strata, etc.)
  • Pivoting estimates (pivot_estimates)
  • Adding settings (add_settings)
  • Filtering by settings, strata, or group values

data property

data: DataFrame

The underlying result DataFrame.

settings property writable

settings: DataFrame

The companion settings DataFrame.

suppress

suppress(min_cell_count: int = 5) -> SummarisedResult

Suppress estimate values where counts are below min_cell_count.

Following the R implementation: 1. Identify rows where variable_name is in GROUP_COUNT_VARIABLES and estimate_value < min_cell_count. 2. Mark those result_id + group + strata + variable combinations. 3. Set estimate_value to "-" (suppressed sentinel) for those rows and linked percentage rows.

split_group

split_group() -> pl.DataFrame

Split group_name/group_level into individual columns.

split_strata

split_strata() -> pl.DataFrame

Split strata_name/strata_level into individual columns.

split_additional

split_additional() -> pl.DataFrame

Split additional_name/additional_level into individual columns.

split_all

split_all() -> pl.DataFrame

Split all name-level pair columns.

unite_group

unite_group(columns: list[str]) -> SummarisedResult

Unite columns into group_name/group_level.

unite_strata

unite_strata(columns: list[str]) -> SummarisedResult

Unite columns into strata_name/strata_level.

unite_additional

unite_additional(columns: list[str]) -> SummarisedResult

Unite columns into additional_name/additional_level.

pivot_estimates

pivot_estimates() -> pl.DataFrame

Pivot estimate_name/estimate_value into wide format.

Each unique estimate_name becomes a column, with values from estimate_value, cast according to estimate_type.

add_settings

add_settings(
    columns: list[str] | None = None,
) -> pl.DataFrame

Join settings columns to the result data.

If columns is None, all settings columns are joined.

filter_settings

filter_settings(**kwargs: Any) -> SummarisedResult

Filter by settings values.

Example::

result.filter_settings(result_type="cohort_count")

filter_group

filter_group(**kwargs: str) -> SummarisedResult

Filter by group name-level pairs.

filter_strata

filter_strata(**kwargs: str) -> SummarisedResult

Filter by strata name-level pairs.

filter_additional

filter_additional(**kwargs: str) -> SummarisedResult

Filter by additional name-level pairs.

tidy

tidy() -> pl.DataFrame

Convert to a tidy DataFrame.

Add settings + split all name-level pairs + pivot.

Schema Definitions

CdmSchema

CdmSchema

CdmSchema(version: CdmVersion = CdmVersion.V5_4)

Registry for OMOP CDM schema specifications.

All data is lazily loaded and cached at the class level on first access.

Usage::

schema = CdmSchema(CdmVersion.V5_4)
person_fields = schema.fields_for_table("person")
required_tables = schema.required_table_names()

field_specs property

field_specs: tuple[FieldSpec, ...]

All field specs for this CDM version.

table_specs property

table_specs: tuple[TableSpec, ...]

All table-level specs for this CDM version.

result_field_specs property

result_field_specs: tuple[ResultFieldSpec, ...]

Specs for summarised_result / settings fields.

field_table_columns property

field_table_columns: tuple[_FieldTableColumn, ...]

Semantic column mappings for clinical tables.

fields_for_table

fields_for_table(table_name: str) -> tuple[FieldSpec, ...]

Return field specs for a specific table.

required_fields_for_table

required_fields_for_table(
    table_name: str,
) -> tuple[FieldSpec, ...]

Return only required field specs for a table.

table_names

table_names(
    *, table_type: TableType | None = None
) -> tuple[str, ...]

Return all table names, optionally filtered by type.

required_table_names

required_table_names() -> tuple[str, ...]

Return names of tables marked as required at table level.

table_names_in_group

table_names_in_group(group: TableGroup) -> tuple[str, ...]

Return table names belonging to a logical group.

table_spec_for

table_spec_for(table_name: str) -> TableSpec | None

Return the TableSpec for a specific table, or None.

field_column_info

field_column_info(
    table_name: str,
) -> _FieldTableColumn | None

Get semantic column mapping for a clinical table.

validate_columns

validate_columns(
    table_name: str,
    columns: Sequence[str],
    *,
    check_required: bool = True,
) -> list[str]

Validate columns against the spec. Returns list of error messages.

Checks: 1. If check_required, all required columns must be present. 2. (Warning-level) Extra columns not in spec are noted.

FieldSpec

FieldSpec

Bases: BaseModel

A single field in a CDM table (from fieldsTables).

varchar_length property

varchar_length: int | None

Extract max length from varchar(N) or varchar(max).

TableSpec

TableSpec

Bases: BaseModel

Table-level metadata from the CDM spec CSVs.

ResultFieldSpec

ResultFieldSpec

Bases: BaseModel

Field specification for a summarised/compared result.

Enums

CdmVersion

CdmVersion

Bases: StrEnum

Supported OMOP CDM versions.

CdmDataType

CdmDataType

Bases: StrEnum

Data types used in OMOP CDM field specifications.

from_spec classmethod

from_spec(raw: str) -> CdmDataType

Parse a CDM datatype string like 'varchar(50)' or 'integer'.

TableType

TableType

Bases: StrEnum

Classification of CDM table types.

TableGroup

TableGroup

Bases: StrEnum

Logical groupings of CDM tables for batch selection.

TableSchema

TableSchema

Bases: StrEnum

Database schema a CDM table lives in.

Type Aliases & Constants

CdmVersionLiteral

CdmVersionLiteral module-attribute

CdmVersionLiteral = Literal['5.3', '5.4']

SUPPORTED_CDM_VERSIONS

SUPPORTED_CDM_VERSIONS module-attribute

SUPPORTED_CDM_VERSIONS: tuple[str, ...] = ('5.3', '5.4')

NAME_LEVEL_SEP

NAME_LEVEL_SEP module-attribute

NAME_LEVEL_SEP: str = ' &&& '

OVERALL

OVERALL module-attribute

OVERALL: str = 'overall'

COHORT_REQUIRED_COLUMNS

COHORT_REQUIRED_COLUMNS module-attribute

COHORT_REQUIRED_COLUMNS: tuple[str, ...] = (
    "cohort_definition_id",
    "subject_id",
    "cohort_start_date",
    "cohort_end_date",
)

SUMMARISED_RESULT_COLUMNS

SUMMARISED_RESULT_COLUMNS module-attribute

SUMMARISED_RESULT_COLUMNS: tuple[str, ...] = (
    "result_id",
    "cdm_name",
    "group_name",
    "group_level",
    "strata_name",
    "strata_level",
    "variable_name",
    "variable_level",
    "estimate_name",
    "estimate_type",
    "estimate_value",
    "additional_name",
    "additional_level",
)

SETTINGS_REQUIRED_COLUMNS

SETTINGS_REQUIRED_COLUMNS module-attribute

SETTINGS_REQUIRED_COLUMNS: tuple[str, ...] = (
    "result_id",
    "result_type",
    "package_name",
    "package_version",
)

GROUP_COUNT_VARIABLES

GROUP_COUNT_VARIABLES module-attribute

GROUP_COUNT_VARIABLES: tuple[str, ...] = (
    "number subjects",
    "number records",
)

Validation Functions

assert_character

assert_character(
    value: Any,
    *,
    name: str = "value",
    min_length: int | None = None,
    max_length: int | None = None,
    na_allowed: bool = True,
    null_allowed: bool = False,
) -> None

Assert value is a string or sequence of strings.

assert_choice

assert_choice(
    value: Any,
    choices: Sequence[Any],
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is one of the given choices.

assert_class

assert_class(
    value: Any,
    cls: type | tuple[type, ...],
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is an instance of cls.

assert_date

assert_date(
    value: Any,
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is a datetime.date (or datetime).

assert_list

assert_list(
    value: Any,
    *,
    name: str = "value",
    element_class: type | None = None,
    min_length: int | None = None,
    null_allowed: bool = False,
) -> None

Assert value is a list (or sequence).

assert_logical

assert_logical(
    value: Any,
    *,
    name: str = "value",
    null_allowed: bool = False,
) -> None

Assert value is a boolean.

assert_numeric

assert_numeric(
    value: Any,
    *,
    name: str = "value",
    min_val: int | float | None = None,
    max_val: int | float | None = None,
    null_allowed: bool = False,
) -> None

Assert value is numeric (int or float).

assert_true

assert_true(
    condition: bool, *, msg: str = "Assertion failed"
) -> None

Assert a boolean condition is True.

assert_table_columns

assert_table_columns(
    columns: Sequence[str],
    required: Sequence[str],
    *,
    table_name: str = "table",
) -> None

Assert all required columns are present in columns.

I/O Functions

export_codelist

export_codelist(
    codelist: Codelist,
    path: str | Path,
    *,
    format: str = "csv",
) -> Path

Export a Codelist to a file.

Parameters:

Name Type Description Default
codelist Codelist

The codelist to export.

required
path str | Path

Directory to write files to.

required
format str

'csv' writes a single CSV; 'json' writes one JSON per codelist entry (ATLAS format).

'csv'

import_codelist

import_codelist(
    path: str | Path, *, format: str | None = None
) -> Codelist

Import a Codelist from file(s).

If path is a CSV file, reads it (expects codelist_name, concept_id). If path is a directory, reads all .json files as individual concept sets.

export_concept_set_expression

export_concept_set_expression(
    cse: ConceptSetExpression,
    path: str | Path,
    *,
    format: str = "json",
) -> Path

Export a ConceptSetExpression to JSON files (one per concept set).

import_concept_set_expression

import_concept_set_expression(
    path: str | Path, *, format: str | None = None
) -> ConceptSetExpression

Import a ConceptSetExpression from JSON file(s) or a CSV.

export_summarised_result

export_summarised_result(
    result: SummarisedResult,
    path: str | Path,
    *,
    min_cell_count: int = 5,
) -> Path

Export a SummarisedResult to a CSV file.

Applies suppression before export. Settings are stored as additional rows in the same CSV with a special marker column.

import_summarised_result

import_summarised_result(
    path: str | Path,
) -> SummarisedResult

Import a SummarisedResult from a CSV file.