Incidence & Prevalence¶
The omopy.incidence module estimates incidence rates and prevalence
proportions from OMOP CDM data. It is the Python equivalent of the R
IncidencePrevalence package.
Overview¶
The analysis pipeline has three stages:
- Denominator generation — build denominator cohorts from observation periods, optionally stratified by age, sex, and prior observation
- Estimation — compute incidence rates or prevalence proportions over calendar intervals with confidence intervals
- Presentation — format results as tables or plots
Step 1: Generate a Denominator¶
The denominator defines the population at risk. It is built from the
observation_period table, clipped to a study window and filtered by
demographic criteria.
from omopy.connector import cdm_from_con
from omopy.incidence import generate_denominator_cohort_set
cdm = cdm_from_con("path/to/omop.duckdb", cdm_schema="cdm")
cdm = generate_denominator_cohort_set(
cdm,
name="denominator",
study_period_start="2010-01-01",
study_period_end="2020-12-31",
)
# The denominator is now a CohortTable in the CDM
print(cdm["denominator"].collect())
Age and Sex Stratification¶
Generate multiple denominator cohorts stratified by age groups and sex:
cdm = generate_denominator_cohort_set(
cdm,
name="denominator",
age_group=[(0, 17), (18, 64), (65, 150)],
sex=["Male", "Female", "Both"],
days_prior_observation=365,
study_period_start="2015-01-01",
study_period_end="2020-12-31",
)
# Settings show one row per stratum combination
print(cdm["denominator"].settings)
Each combination of age_group and sex produces a separate cohort
definition in the returned CohortTable, with attrition tracking showing
how many persons were excluded at each filtering step.
Target-Based Denominator¶
If you already have a target cohort (e.g., persons with a specific condition), restrict the denominator to those persons:
from omopy.incidence import generate_target_denominator_cohort_set
cdm = generate_target_denominator_cohort_set(
cdm,
target_cohort_table="my_target_cohort",
name="target_denom",
study_period_start="2015-01-01",
study_period_end="2020-12-31",
)
Step 2: Estimate Incidence or Prevalence¶
Incidence¶
Incidence measures the rate of new outcome events per person-time at risk:
from omopy.incidence import estimate_incidence
result = estimate_incidence(
cdm,
denominator_table="denominator",
outcome_table="outcome_cohort",
interval="years",
outcome_washout=float("inf"), # Only first event per person
repeating_events=False,
)
# Result is a SummarisedResult
print(result.data.columns)
Key parameters:
interval—"weeks","months","quarters","years", or"overall"outcome_washout— days to exclude after a prior event (inf= first event only)repeating_events— ifTrue, allow the same person to contribute multiple eventscomplete_database_intervals— ifTrue, only include intervals where the entire denominator is observable
Confidence intervals use the Poisson exact method (chi-squared quantiles
via scipy.stats.chi2). Results are expressed per 100,000 person-years.
Point Prevalence¶
Point prevalence measures the proportion with an active outcome at a specific time point within each interval:
from omopy.incidence import estimate_point_prevalence
result = estimate_point_prevalence(
cdm,
denominator_table="denominator",
outcome_table="outcome_cohort",
interval="years",
time_point="start", # or "middle", "end"
)
Period Prevalence¶
Period prevalence measures the proportion with any active outcome during each interval:
from omopy.incidence import estimate_period_prevalence
result = estimate_period_prevalence(
cdm,
denominator_table="denominator",
outcome_table="outcome_cohort",
interval="quarters",
)
Prevalence confidence intervals use the Wilson score method
(scipy.stats.norm).
Step 3: View Results¶
Convert to Tidy DataFrames¶
Pivot the long-form SummarisedResult into a wide, analysis-ready DataFrame:
from omopy.incidence import as_incidence_result, as_prevalence_result
# Wide DataFrame with named columns
inc_df = as_incidence_result(result)
print(inc_df.columns)
# ['denominator_cohort_name', 'outcome_cohort_name', 'incidence_start_date',
# 'incidence_end_date', 'n_events', 'n_persons', 'person_years',
# 'incidence_100000_pys', 'incidence_100000_pys_95ci_lower',
# 'incidence_100000_pys_95ci_upper', ...]
prev_df = as_prevalence_result(prev_result)
Tables¶
Format results as display-ready tables:
from omopy.incidence import table_incidence, table_prevalence
# Polars DataFrame with formatted columns
df = table_incidence(result, type="polars")
# great_tables GT object for rich display
gt = table_incidence(result, type="gt")
Attrition tables show how many persons were excluded at each step:
from omopy.incidence import table_incidence_attrition
attrition_table = table_incidence_attrition(result)
Plots¶
Visualize incidence and prevalence trends:
from omopy.incidence import plot_incidence, plot_prevalence
# Incidence over time — scatter plot with CI ribbons
fig = plot_incidence(result)
fig.show()
# Prevalence over time
fig = plot_prevalence(prev_result)
fig.show()
Population pyramid plots show the denominator size per interval:
from omopy.incidence import plot_incidence_population
fig = plot_incidence_population(result)
fig.show()
Grouping and Faceting¶
Discover which columns are available for grouping in plots:
from omopy.incidence import available_incidence_grouping
grouping_cols = available_incidence_grouping(result)
# e.g., ['denominator_cohort_name', 'outcome_cohort_name', ...]
Mock Data¶
Generate synthetic incidence/prevalence data for testing and prototyping:
from omopy.incidence import mock_incidence_prevalence
mock_cdm = mock_incidence_prevalence(
sample_size=1000,
seed=42,
)
# mock_cdm has person, observation_period, target, and outcome tables
Benchmarking¶
Time the full analysis pipeline:
from omopy.incidence import benchmark_incidence_prevalence
timing = benchmark_incidence_prevalence(cdm)
print(f"Total time: {timing['total']:.2f}s")
Working with Results¶
All estimation functions return SummarisedResult objects from
omopy.generics. These support standard operations:
# Tidy format (unpack group/strata into named columns)
tidy_df = result.tidy()
# Filter by settings
filtered = result.filter_settings(result_type="estimate_incidence")
# Apply minimum cell count suppression
suppressed = result.suppress(min_cell_count=5)
# Split by group
groups = result.split_group()
See the SummarisedResult reference for full details.