Cohort Characteristics¶

The omopy.characteristics module provides analytical functions for characterizing cohorts defined in OMOP CDM databases. It is the Python equivalent of the R CohortCharacteristics package.

Overview¶

The module has three layers:

Summarise — compute statistics and return SummarisedResult objects
Table — format results as display-ready tables (via omopy.vis)
Plot — visualize results as plotly figures

Summarise Characteristics¶

The primary function for cohort characterization computes demographics, counts, and clinical variable distributions:

from omopy.connector import cdm_from_con
from omopy.connector import generate_concept_cohort_set
from omopy.generics import Codelist
from omopy.characteristics import summarise_characteristics

# Connect and generate a cohort
cdm = cdm_from_con("path/to/omop.duckdb", cdm_schema="cdm")
codelist = Codelist({"hypertension": [320128]})
cdm = generate_concept_cohort_set(cdm, codelist, name="my_cohort")

# Summarise characteristics with demographics
result = summarise_characteristics(
    cdm["my_cohort"],
    demographics=True,
    counts=True,
)

# Result is a SummarisedResult with 13 standard columns
print(result.data)

The result includes:

Number records and Number subjects per cohort
Age — min, q25, median, q75, max, mean, sd
Sex — count and percentage per category
Prior observation and Future observation — distribution
Days in cohort — distribution

Stratification¶

Add strata columns to break down results by subgroups:

from omopy.profiles import add_sex

# Pre-add the column you want to stratify by
cohort_with_sex = add_sex(cdm["my_cohort"], cdm)

result = summarise_characteristics(
    cohort_with_sex,
    strata=["sex"],
    demographics=True,
)

# Result includes both overall and per-sex strata
strata = result.data["strata_name"].unique().to_list()
# ['overall', 'sex']

Filtering by Cohort ID¶

All summarise functions accept a cohort_id parameter to restrict analysis to specific cohort definitions:

result = summarise_characteristics(
    cdm["my_cohort"],
    cohort_id=[1],  # Only cohort definition ID 1
    demographics=True,
)

Adding Intersections¶

Enrich with clinical intersections before summarising:

result = summarise_characteristics(
    cdm["my_cohort"],
    demographics=True,
    table_intersect_flag=[
        {"table_name": "drug_exposure", "window": (-365, -1)},
    ],
)

Cohort Counts¶

Quick subject and record counts per cohort:

from omopy.characteristics import summarise_cohort_count

counts = summarise_cohort_count(cdm["my_cohort"])

Cohort Attrition¶

Summarise the step-by-step attrition (filtering) that built a cohort:

from omopy.characteristics import summarise_cohort_attrition

attrition = summarise_cohort_attrition(cdm["my_cohort"])

The result has strata_name="reason" and additional_name="reason_id", with variables for number_records, number_subjects, excluded_records, and excluded_subjects.

Cohort Timing¶

Compute the distribution of days between entries across different cohorts for subjects appearing in multiple cohorts:

from omopy.characteristics import summarise_cohort_timing

timing = summarise_cohort_timing(cdm["my_cohort"])

The group_name uses the compound format "cohort_name_reference &&& cohort_name_comparator".

Cohort Overlap¶

Count subjects appearing in one, both, or neither of two cohorts:

from omopy.characteristics import summarise_cohort_overlap

overlap = summarise_cohort_overlap(cdm["my_cohort"])

Returns counts and percentages for "Only in reference cohort", "Only in comparator cohort", and "In both cohorts".

Large-Scale Characteristics¶

Compute concept-level prevalence across time windows:

from omopy.characteristics import summarise_large_scale_characteristics

lsc = summarise_large_scale_characteristics(
    cdm["my_cohort"],
    event_in_window=["condition_occurrence", "drug_exposure"],
    window=[(-365, -1), (0, 0), (1, 365)],
    minimum_frequency=0.01,
)

Cohort Codelist¶

Summarise the codelists used to define each cohort:

from omopy.characteristics import summarise_cohort_codelist

codelist = summarise_cohort_codelist(cdm["my_cohort"])

Tables¶

All table functions wrap omopy.vis.vis_omop_table() with domain-specific formatting defaults. They accept a SummarisedResult and return a formatted table (Polars DataFrame by default, or great_tables.GT):

from omopy.characteristics import (
    summarise_characteristics,
    table_characteristics,
)

result = summarise_characteristics(cdm["my_cohort"], demographics=True)

# Polars DataFrame with formatted columns
df = table_characteristics(result, type="polars")

# great_tables GT object for rich display
gt = table_characteristics(result, type="gt")

Available table functions:

Function	Input result_type
`table_characteristics`	`summarise_characteristics`
`table_cohort_count`	`summarise_cohort_count`
`table_cohort_attrition`	`summarise_cohort_attrition`
`table_cohort_timing`	`summarise_cohort_timing`
`table_cohort_overlap`	`summarise_cohort_overlap`
`table_large_scale_characteristics`	`summarise_large_scale_characteristics`
`table_top_large_scale_characteristics`	`summarise_large_scale_characteristics`

Plots¶

All plot functions return plotly.graph_objects.Figure objects:

from omopy.characteristics import (
    summarise_cohort_count,
    plot_cohort_count,
)

counts = summarise_cohort_count(cdm["my_cohort"])
fig = plot_cohort_count(counts)
fig.show()

Available plot functions:

Function	Chart Type
`plot_characteristics`	Bar, scatter, or box plot
`plot_cohort_count`	Bar chart
`plot_cohort_attrition`	Flowchart (Plotly shapes)
`plot_cohort_timing`	Box or density plot
`plot_cohort_overlap`	Stacked bar chart
`plot_large_scale_characteristics`	Scatter plot
`plot_compared_large_scale_characteristics`	Scatter with diagonal reference

Mock Data¶

Generate mock SummarisedResult objects for testing:

from omopy.characteristics import mock_cohort_characteristics

mock = mock_cohort_characteristics(
    n_cohorts=2,
    seed=42,
)

Working with Results¶

All summarise functions return SummarisedResult objects from omopy.generics. These support standard operations:

# Tidy format (unpack group/strata into named columns)
tidy_df = result.tidy()

# Pivot estimates into wide format
wide_df = result.pivot_estimates()

# Filter by settings
filtered = result.filter_settings(result_type="summarise_characteristics")

# Apply minimum cell count suppression
suppressed = result.suppress(min_cell_count=5)

# Split by group
groups = result.split_group()

See the SummarisedResult reference for full details.