Pregnancy Episode Identification¶

The omopy.pregnancy module identifies pregnancy episodes from OMOP CDM data using the HIPPS algorithm (Smith et al. 2024). It is the Python equivalent of the R PregnancyIdentifier package.

Overview¶

The module has four layers:

Identify — run the HIPPS algorithm to extract pregnancy episodes
Summarise — convert episodes into the standard SummarisedResult format
Table — format results as publication-ready tables (via omopy.vis)
Plot — visualise outcome distributions, gestational age, and timelines

The HIPPS Algorithm¶

HIPPS (Hierarchical Identification of Pregnancy Periods and States) combines two complementary approaches:

Stage	Name	Description
HIP	Outcome-anchored	Identifies episodes by locating pregnancy outcome codes (live birth, stillbirth, abortion, etc.) and working backwards to estimate the start date
PPS	Gestational-timing	Identifies episodes by locating gestational age markers and estimating start from the timing
Merge	HIPPS merge	Combines HIP and PPS episodes, resolving conflicts
ESD	Episode Start Date	Refines episode start dates using supporting evidence (e.g., LMP records, prenatal visits)

Outcome Categories¶

The algorithm classifies pregnancy outcomes into 8 categories:

Code	Category
`LB`	Live birth
`SB`	Stillbirth
`AB`	Induced abortion
`SA`	Spontaneous abortion
`DELIV`	Delivery (unspecified)
`ECT`	Ectopic pregnancy
`PREG`	Pregnancy (ongoing / unspecified outcome)

These are available as the OUTCOME_CATEGORIES constant.

Step 1: Connect to CDM¶

import ibis
from omopy.connector import cdm_from_con

con = ibis.duckdb.connect("my_database.duckdb", read_only=True)
cdm = cdm_from_con(con, cdm_schema="cdm")

Step 2: Identify Pregnancies¶

from omopy.pregnancy import identify_pregnancies
import datetime

result = identify_pregnancies(
    cdm,
    start_date=datetime.date(2015, 1, 1),  # Study window start
    end_date=datetime.date(2023, 12, 31),  # Study window end
    age_bounds=(10, 55),                   # Age range (years)
    just_gestation=False,                  # Include non-gestation evidence
    min_cell_count=5,                      # Privacy threshold
)

The result is a PregnancyResult object containing the full pipeline output:

# Episode DataFrames (Polars)
result.episodes            # Final merged episodes
result.hip_episodes        # HIP-only episodes (outcome-anchored)
result.pps_episodes        # PPS-only episodes (gestational-timing)
result.merged_episodes     # Pre-ESD merged episodes

# Metadata
result.metadata            # Dict with pipeline parameters and counts

Key Parameters¶

Parameter	Default	Description
`start_date`	`None`	Study window start (`datetime.date` or `None` for all)
`end_date`	`None`	Study window end (`datetime.date` or `None` for all)
`age_bounds`	`(10, 55)`	Min/max age at pregnancy start
`just_gestation`	`True`	If `True`, only use gestation-related evidence
`min_cell_count`	`5`	Privacy suppression threshold

Episode DataFrame Columns¶

The result.episodes DataFrame contains:

Column	Type	Description
`person_id`	Int64	Patient identifier
`episode_start_date`	Date	Estimated pregnancy start
`episode_end_date`	Date	Outcome/end date
`outcome_category`	Utf8	One of the 8 outcome codes
`gestational_days`	Int64	Estimated gestational length in days
`source`	Utf8	`"hip"`, `"pps"`, or `"merged"`
`confidence`	Utf8	Confidence level of the estimate

Step 3: Summarise¶

Convert episodes to the standard 13-column SummarisedResult format:

from omopy.pregnancy import summarise_pregnancies

summary = summarise_pregnancies(
    result,
    strata=["outcome_category"],  # Optional grouping
)

The summary includes counts, proportions, and gestational age statistics per outcome category (and per stratum if specified).

Step 4: Visualise¶

Tables¶

from omopy.pregnancy import table_pregnancies

# Polars DataFrame
tbl = table_pregnancies(summary, type="polars")

# great-tables GT object
tbl = table_pregnancies(summary, type="gt")

Plots¶

from omopy.pregnancy import plot_pregnancies

# Outcome distribution bar chart
fig = plot_pregnancies(summary, type="outcome_distribution")
fig.show()

# Gestational age box plot
fig = plot_pregnancies(summary, type="gestational_age")

# Timeline plot
fig = plot_pregnancies(summary, type="timeline")

Validation¶

Validate episode periods for consistency:

from omopy.pregnancy import validate_episodes
import polars as pl

episodes = result.episodes
issues = validate_episodes(episodes, max_days=320)
# Returns a DataFrame of episodes with potential issues
# (e.g., gestational period > max_days, overlapping episodes)

Testing with Mock Data¶

Generate a synthetic CDM with pregnancy-related records:

from omopy.pregnancy import mock_pregnancy_cdm

mock_cdm = mock_pregnancy_cdm(
    seed=42,
    n_persons=50,
)

# Use mock CDM with the full pipeline
result = identify_pregnancies(mock_cdm)
summary = summarise_pregnancies(result)

Working with Results¶

All summarise functions return SummarisedResult objects from omopy.generics. These support standard operations:

# Tidy format (unpack group/strata into named columns)
tidy_df = summary.tidy()

# Filter by settings
filtered = summary.filter_settings(result_type="pregnancy_summary")

# Apply minimum cell count suppression
suppressed = summary.suppress(min_cell_count=5)

# Split by group
groups = summary.split_group()

See the SummarisedResult reference for full details.

Comparison with R¶

R (PregnancyIdentifier)	Python (omopy.pregnancy)
`identifyPregnancies()`	`identify_pregnancies()`
R list with data.frames	`PregnancyResult` (Pydantic model, Polars DataFrames)
`summarisePregnancies()`	`summarise_pregnancies()`
`tablePregnancies()`	`table_pregnancies()`
`plotPregnancies()`	`plot_pregnancies()`
`mockPregnancyCdm()`	`mock_pregnancy_cdm()`
`validateEpisodes()`	`validate_episodes()`
`OUTCOME_CATEGORIES`	`OUTCOME_CATEGORIES`