Architecture¶
OMOPy is a single Python package (omopy) that consolidates 17 R packages from
the DARWIN-EU ecosystem into a layered monorepo architecture.
Package Structure¶
omopy/
├── omopy.generics ← Core type system (Phase 0)
├── omopy.connector ← Database CDM access (Phase 1+2)
│ └── circe/ ← CIRCE cohort engine
├── omopy.profiles ← Patient-level enrichment (Phase 3A)
├── omopy.codelist ← Vocabulary codelist tools (Phase 3B)
├── omopy.vis ← Visualisation (Phase 3C)
├── omopy.characteristics ← Cohort characterisation (Phase 4A)
├── omopy.incidence ← Incidence & prevalence (Phase 4B)
├── omopy.drug ← Drug utilisation (Phase 5A)
├── omopy.survival ← Cohort survival analysis (Phase 5B)
├── omopy.treatment ← Treatment pathway analysis (Phase 6A)
├── omopy.drug_diagnostics ← Drug exposure diagnostics (Phase 6B)
├── omopy.pregnancy ← Pregnancy episode identification (Phase 7A)
└── omopy.testing ← Test data generation (Phase 8A)
Layer Dependencies¶
Layer 0: omopy.generics
▲
Layer 1: omopy.connector
▲
Layer 2: omopy.profiles ──── omopy.codelist ──── omopy.vis
▲ ▲
Layer 3: omopy.characteristics │
omopy.incidence ─────────┘
omopy.survival
omopy.treatment
▲
Layer 4: omopy.drug
omopy.drug_diagnostics
omopy.pregnancy
Layer 5: omopy.testing (independent — only needs generics + connector)
Each higher-level module depends only on modules below it. omopy.generics has
no internal dependencies and can be used standalone.
Technology Stack¶
| Concern | Technology | Role |
|---|---|---|
| Lazy query construction | Ibis | Database-agnostic SQL generation |
| Database connections | SQLAlchemy | Connection URIs and pooling |
| SQL transpilation | sqlglot | Dialect-aware SQL manipulation |
| Local DataFrames | Polars | Primary in-memory DataFrame |
| Compatibility | Pandas | Interop with Ibis .execute() |
| Data models | Pydantic v2 | Frozen, validated data classes |
| Arrow interchange | PyArrow | Zero-copy data transfer |
| Table rendering | great-tables | Publication-ready tables |
| Plot rendering | plotly | Interactive visualisations |
| Statistics | scipy | Confidence intervals |
| Survival analysis | lifelines | Kaplan-Meier estimation |
Design Decisions¶
Lazy by Default¶
All CDM table access returns Ibis expressions. SQL is only executed when you
explicitly call .collect() (Polars) or .execute() (Pandas). This means:
- Queries are composed without hitting the database
- The database optimizer sees the full query plan
- Memory usage stays constant regardless of table size
- You can work with tables containing billions of rows
Frozen Pydantic Models¶
Configuration and schema objects use model_config = ConfigDict(frozen=True):
- Thread-safe by construction
- Hashable (usable as dict keys and in sets)
- Validated on creation (Pydantic catches type errors early)
- Serialisable to JSON via
.model_dump_json()
CdmReference as a Dict-Like Container¶
A CdmReference acts like a dict[str, CdmTable] with metadata:
cdm["person"] # access table by name
cdm.table_names # list all table names
cdm.cdm_version # "5.3" or "5.4"
cdm.cdm_name # data source name
Cohort generation returns a new CdmReference with the cohort table added.
CDM objects are not mutated in place.
CIRCE Engine¶
The CIRCE cohort generation engine is a clean-room Python implementation built against the CIRCE JSON specification. It was NOT ported from R source code.
The engine:
- Parses ATLAS JSON into typed Pydantic models (
CohortExpression) - Resolves concept sets against vocabulary tables
- Builds Ibis query plans for primary criteria, inclusion rules, end strategies
- Executes the final cohort as a materialised database table
Column Naming Conventions¶
All column names use snake_case. The original OMOP CDM column names are
preserved as-is (they already use snake_case). Generated columns from
omopy.profiles follow the pattern:
For example: flag_condition_occurrence_0_to_inf, age, sex.
OMOP CDM Support¶
| Version | Tables | Fields | Status |
|---|---|---|---|
| v5.3 | 37 | 448 | Fully supported |
| v5.4 | 39 | 484 | Fully supported |
Schema specifications are loaded from bundled CSV files and cached per version.