This document provides comprehensive documentation for FastSSV's validation rules.
Current registry: 154 rules across 6 categories.
anti_patterns: 20 rules
concept_standardization: 18 rules
data_quality: 22 rules
domain_specific: 48 rules
joins: 36 rules
temporal: 10 rules
Note: The prose descriptions below reflect the stable rule set. Rule
removals and rule renames are tracked in CHANGELOG.md.
If a rule listed here is no longer in the registry, check the CHANGELOG
for the removal rationale. If a rule appears in the registry but not
below, it was added after this document's last full pass — the rule's
own docstring and suggested_fix field are the authoritative source.
For the live registered rule set at any moment, use:
The concept_ancestor table represents hierarchical relationships within domains:
- A Condition ancestor has only Condition descendants
- A Drug ancestor has only Drug descendants
- A Procedure ancestor has only Procedure descendants
Cross-domain relationships exist in concept_relationship (e.g., 'Has indication'),
NOT in concept_ancestor.
Filtering descendant concepts by a different domain_id than the ancestor's
domain will always return zero results.
Common mistake scenarios:
1. Trying to find drugs to treat a condition via concept_ancestor
(should use concept_relationship with 'Has indication')
2. Mixing domains when expanding hierarchies
(e.g., drug ancestor with procedure descendants)
3. Misunderstanding OMOP's domain architecture
SELECTca.descendant_concept_idFROMconcept_ancestorcaJOINconceptcONca.descendant_concept_id=c.concept_idWHEREca.ancestor_concept_id=201820-- Condition: DiabetesANDc.domain_id='Drug'-- ERROR: No Drug descendants!
The concept_ancestor table tracks hierarchical relationships with two distance columns:
- min_levels_of_separation: The shortest path from ancestor to descendant
- max_levels_of_separation: The LONGEST path from ancestor to descendant
Due to multiple inheritance (a concept can have multiple parents), there can be
multiple paths between an ancestor and descendant. The max_levels_of_separation
represents the longest of these paths.
Common misconception:
- Users think max_levels_of_separation = 1 returns "direct children"
- This is WRONG because a concept with multiple paths might have:
* min_levels_of_separation = 1 (direct child via one path)
* max_levels_of_separation = 3 (via a longer alternate path)
Using max_levels_of_separation = 1 will MISS concepts that have direct
relationships but also have longer alternate paths.
Correct usage:
- Use min_levels_of_separation = 1 for direct parent-child relationships
- Use max_levels_of_separation <= N to limit maximum depth to explore
- Use max_levels_of_separation >= N to find distant relationships only
concept_ancestor represents hierarchical relationships:
- descendant_concept_id: The more specific child concept (e.g., "Type 2 Diabetes")
- ancestor_concept_id: The more general parent concept (e.g., "Diabetes Mellitus")
Patient records contain specific diagnoses (descendants), not parent concepts.
To roll up to parent concepts, you must:
1. Join clinical table's concept_id to descendant_concept_id
2. Filter on ancestor_concept_id
Swapping this reverses the hierarchy, causing:
- Missed child concepts
- Incorrect aggregations
- Potentially zero results
The concept_ancestor table includes self-referencing rows where:
- ancestor_concept_id = descendant_concept_id
- min_levels_of_separation = 0
- max_levels_of_separation = 0
This means every concept is its own ancestor at distance 0.
When building concept sets, the anchor concept is AUTOMATICALLY included
in concept_ancestor results. Queries that explicitly add the anchor concept
AND query concept_ancestor will duplicate the anchor.
OMOP drug concepts exist in a specificity hierarchy:
Ingredient (most general - active substance)
↓
Clinical Drug Form (dose form + ingredient)
↓
Clinical Drug (formulation without brand)
↓
Branded Drug (brand name + formulation)
↓
Marketed Product (most specific - commercial package)
When analysts want to group by active ingredient (e.g., "all Metformin prescriptions"),
but filter by concept_class_id = 'Clinical Drug' or 'Branded Drug', they're grouping
at the wrong level of granularity.
Issues with wrong concept_class_id:
1. Clinical Drug: Groups by formulation (500mg tablet vs 850mg tablet counted separately)
2. Branded Drug: Groups by brand (Glucophage vs Fortamet vs generic counted separately)
3. Missing data: Fails to capture all forms of the active ingredient
This leads to:
- Undercounting drug exposure (split across formulations/brands)
- Incorrect prevalence estimates
- Misleading comparative effectiveness analyses
- Fragmented ingredient-level statistics
The concept_synonym table stores synonym names in multiple languages:
- English (language_concept_id = 4180186)
- Spanish, German, French, etc. (other language_concept_id values)
When searching for synonyms by name (LIKE '%heart attack%'), you may
inadvertently retrieve synonyms in multiple languages if you don't
filter by language_concept_id.
Common mistake:
Developers search concept_synonym_name without considering language,
leading to unexpected multilingual results.
Vocabulary tables (concept, concept_relationship) carry an invalid_reason column that marks retired/superseded entries ('D' = deprecated, 'U' = upgraded, NULL = currently valid). Derived vocabulary tables (concept_ancestor, concept_synonym, drug_strength, source_to_concept_map) reference concept IDs but lack their own invalid_reason column. Queries that omit invalid_reason filtering can silently include retired concepts in cohort definitions or analytic outputs.
This rule is gated behind strict mode: silent in default mode, fires as WARNING under --strict (CLI) or strict=True (API). Real-world OMOP queries on the concept table almost always omit this filter, so firing in default mode would dilute every other rule.
A vocabulary table with invalid_reason (concept, concept_relationship) appears in scope, is being used as a source (filtered by vocabulary_id / domain_id / relationship_id / standard_concept / concept_name / concept_code / concept_class_id), and no invalid_reason predicate is asserted; OR
A derived vocabulary table is used as a source for cohort selection. Source-usage detection (for concept_ancestor specifically) recognizes any of:
Primary FROM:FROM concept_ancestor WHERE ….
Direct JOIN:JOIN concept_ancestor ca ON <clinical>.<concept_id_col> = ca.descendant_concept_id (or ca.ancestor_concept_id).
Chained JOIN through concept:JOIN concept c ON … JOIN concept_ancestor ca ON c.concept_id = ca.descendant_concept_id (or ca.ancestor_concept_id).
Multi-ancestor IN-list:WHERE ca.ancestor_concept_id IN (a, b, c, …) of literals.
All four are detected by a single signal: a literal predicate (= or IN (…) of literals) on concept_ancestor.{ancestor,descendant}_concept_id in WHERE / JOIN-ON. That signal is form-agnostic and reliably distinguishes cohort-source usage from lookup-decoration. Lookup-shape JOINs (e.g. FROM concept c JOIN concept_ancestor ca ON ca.ancestor_concept_id = c.concept_id WHERE c.concept_id = 192671 — concept_ancestor's hierarchy columns are equated against another column rather than a literal) correctly stay silent.
The rule is suppressed when any of the following is asserted in the query:
- invalid_reason IS NULL / IS NOT NULL / = '...' / IN (...) / NOT IN (...).
- A date-validity check using both valid_start_date and valid_end_date.
- standard_concept = 'S' (or IN ('S', …)) — standard concepts are nearly always valid, so the additional invalid_reason check would be belt-and-suspenders noise.
- Lookup-join shape: vocabulary table is joined to another vocabulary table and pinned by specific concept_id literals.
-- IN-subquery cohort idiom (semantically equivalent to the JOIN forms)SELECTde.person_idFROMdrug_exposuredeWHEREde.drug_concept_idIN(SELECTdescendant_concept_idFROMconcept_ancestorWHEREancestor_concept_id=43027253);
Correct patterns:
-- JOIN concept and filter invalid_reasonSELECTde.person_idFROMdrug_exposuredeJOINconcept_ancestorcaONde.drug_concept_id=ca.descendant_concept_idJOINconceptcONc.concept_id=ca.descendant_concept_idWHEREca.ancestor_concept_id=43027253ANDc.invalid_reasonISNULL;
-- standard_concept = 'S' is sufficient — standard concepts are nearly always validSELECTconcept_idFROMconceptWHEREvocabulary_id='SNOMED'ANDstandard_concept='S';
ADD: JOIN concept c ON c.concept_id = <table>.<concept_id_col> and WHERE c.invalid_reason IS NULL to exclude deprecated concepts. For queries on concept / concept_relationship directly, just add AND invalid_reason IS NULL to the WHERE clause.
'Maps to' relationships map source concepts to standard concepts, but:
1. Mapping chains can exist (A → B → C) where intermediate concepts are not final
2. Some concept_id_2 targets may have standard_concept = NULL (deprecated)
3. Data quality issues may result in non-standard targets
Without validating that concept_id_2 is actually standard (standard_concept = 'S'),
queries may return:
- Deprecated concepts
- Intermediate non-standard concepts
- Invalid mappings
The concept_relationship table with relationship_id = 'Maps to' is a
one-to-many relationship:
- A source concept can map to multiple standard concepts
- Assuming only one mapping exists leads to:
* Duplicate rows (when joining without DISTINCT)
* Arbitrary/incomplete results (when using scalar subqueries)
* Missing mappings (when using LIMIT 1)
OMOP semantic rule OMOP_022: The _source_concept_id columns store the original source vocabulary concept. For standard analytical queries and cohort identification, use the primary _concept_id (standard concept) rather than *_source_concept_id.
Replace _source_concept_id with corresponding standard _concept_id column. If this is for ETL validation or source exploration, this warning can be ignored.
16. Source Concept ID Should Not Filter Standard Concepts¶
In OMOP CDM, clinical domain tables have two types of concept ID columns:
1. Standard concept IDs (e.g., condition_concept_id):
- Reference standard concepts for analytics
- Should have standard_concept = 'S'
2. Source concept IDs (e.g., condition_source_concept_id):
- Reference the original source vocabulary codes (ICD-10, CPT, etc.)
- Are intentionally NON-standard (standard_concept IS NULL or 'C')
- Preserve the original coding system used in the source data
Filtering standard_concept = 'S' when joining on *_source_concept_id is
semantically wrong and will typically return zero results.
source_to_concept_map contains mappings from many source vocabularies.
The same source_code can exist in multiple vocabularies with different meanings.
Example: Code "250" exists in:
- ICD-9-CM: Diabetes mellitus
- ICD-10-CM: Different condition
- Local hospital codes: Something else entirely
Without source_vocabulary_id filter, you get ALL mappings for "250",
leading to incorrect concept assignments.
When a query reads from a STANDARD OMOP concept field (e.g. condition_concept_id, drug_concept_id), it must establish that the concept IDs in scope are actually standard concepts. Standard fields can hold non-standard values when ETL assumptions break, so cohort definitions without standard-concept enforcement risk silently mixing in classification-only concepts ('C'), invalid entries, or legacy mappings.
The rule fires when a query references a known-standard concept field and none of the following enforcement signals are present:
Explicit standard-concept filter. A predicate of the form concept.standard_concept = 'S' (or IN ('S')) is asserted in WHERE or JOIN ON.
Maps to resolution. The query joins concept_relationship with relationship_id = 'Maps to', indicating the user is resolving source concepts to standard concepts at query time.
Specific concept-id literal filter. The query restricts the standard field to specific concept IDs via = <id> or IN (<id>, …) — the user has chosen specific concepts, presumably with knowledge of their standardness.
Hierarchy-based filter via concept_ancestor. The query restricts the standard field via a direct reference to concept_ancestor's hierarchy. Three equivalent forms are recognized:
Subquery form:<col> IN (SELECT descendant_concept_id FROM concept_ancestor [WHERE …]) (or ancestor_concept_id).
Direct JOIN:JOIN concept_ancestor ca ON <clinical>.<concept_id_col> = ca.descendant_concept_id (or ca.ancestor_concept_id). The more common idiom in OHDSI cohort SQL — avoids a correlated subquery and produces better optimizer plans.
Chained JOIN via concept:JOIN concept c ON <clinical>.<concept_id_col> = c.concept_id JOIN concept_ancestor ca ON c.concept_id = ca.descendant_concept_id (or ca.ancestor_concept_id). Users adopt this shape when they also want to project columns from concept (e.g. concept_name) in the SELECT list. The intermediate concept JOIN is a relay; the standard-concept guarantee is transitive through the chain.
By OMOP CDM definition, concept_ancestor is a hierarchy over Standard Concepts only — both ancestor_concept_id and descendant_concept_id are guaranteed-standard. Feeding rows from concept_ancestor into a *_concept_id slot (via any of the three forms above) transitively guarantees the standard-concept property, so an additional standard_concept = 'S' filter would be redundant.
The concept_ancestor suppression is scope-limited to direct references. CTE-indirected patterns (WITH cte AS (SELECT descendant_concept_id FROM concept_ancestor …) SELECT … WHERE col IN (SELECT concept_id FROM cte)) still fire, because the rule does not currently inline CTEs to verify the indirected guarantee.
-- No enforcement of any kindSELECTcondition_concept_idFROMcondition_occurrence;
-- Joins concept but does not filter standard_conceptSELECTco.person_idFROMcondition_occurrencecoJOINconceptcONco.condition_concept_id=c.concept_idWHEREc.vocabulary_id='SNOMED';
-- Hierarchy-based filter via concept_ancestor — subquery formSELECTde.person_idFROMdrug_exposuredeWHEREde.drug_concept_idIN(SELECTdescendant_concept_idFROMconcept_ancestorWHEREancestor_concept_id=35416207);
-- Hierarchy-based filter via concept_ancestor — direct JOIN formSELECTde.person_idFROMdrug_exposuredeJOINconcept_ancestorcaONde.drug_concept_id=ca.descendant_concept_idWHEREca.ancestor_concept_id=35416207;
-- Hierarchy-based filter via concept_ancestor — chained JOIN through concept-- (when also projecting concept's columns in SELECT)SELECTco.person_id,c.concept_nameFROMcondition_occurrencecoJOINconceptcONco.condition_concept_id=c.concept_idJOINconcept_ancestorcaONc.concept_id=ca.descendant_concept_idWHEREca.ancestor_concept_id=320128;
-- Specific concept ID literalsSELECT*FROMcondition_occurrenceWHEREcondition_concept_idIN(201826,201820);
.` AND `WHERE c.standard_concept = 'S'`to filter to standard concepts. (Skip if the query already restricts viaconcept_ancestor` or specific concept IDs — the rule should not fire in those cases.)
OMOP semantic rule OMOP_037: The standard_concept column only accepts 'S' (Standard), 'C' (Classification), or NULL (non-standard). Filtering with other values like 'Y', 'N', 1, 0 is incorrect.
In OMOP CDM, all standard unit concepts use the UCUM (Unified Code for Units
of Measure) vocabulary. Unit concept columns (*_unit_concept_id) reference
standard concepts from the UCUM vocabulary.
When queries join unit_concept_id to the concept table and filter by
vocabulary_id != 'UCUM', they may:
- Return non-standard units
- Return zero results
- Miss the intended unit concepts
Affected columns:
- measurement.unit_concept_id
- observation.unit_concept_id
- drug_strength.amount_unit_concept_id
- drug_strength.numerator_unit_concept_id
- drug_strength.denominator_unit_concept_id
- specimen.unit_concept_id
- dose_era.unit_concept_id
Clinical tables have care_site_id (foreign key to care_site.care_site_id).
Joining on other columns (e.g., care_site_id to location_id) is semantically
incorrect and produces wrong results.
Clinical tables have care_site_id (foreign key to care_site.care_site_id).
Direct joins to location.location_id bypass the care_site intermediary and are
semantically incorrect (comparing care_site_id with location_id).
Exception: person.location_id is valid - it represents the person's home address.
care_site has location_id (foreign key to location.location_id) to identify
the geographic location of the care site. Joining on other columns (e.g.,
care_site_id to location_id) is semantically incorrect and produces wrong results.
Each clinical event table has its own independent primary key sequence:
- condition_occurrence_id = 123 → A specific condition event
- drug_exposure_id = 123 → A specific drug exposure event
- procedure_occurrence_id = 123 → A specific procedure event
These are completely unrelated - they just happen to have overlapping numeric
values. Joining them is always semantically meaningless.
Joining clinical tables without person_id linkage can match data from different
patients, leading to completely invalid results. For example:
- Patient A has condition_start_date = '2020-01-15'
- Patient B has drug_exposure_start_date = '2020-01-15'
Joining on dates alone would incorrectly associate Patient A's condition with
Patient B's drug exposure.
Clinical tables that require person_id linkage:
- condition_occurrence
- drug_exposure
- procedure_occurrence
- measurement
- observation
- visit_occurrence
- visit_detail
- death
- person
Clinical tables have both visit_occurrence_id and visit_detail_id columns.
These represent different ID spaces:
- visit_occurrence_id: Links to parent visit (visit_occurrence table)
- visit_detail_id: Links to detailed sub-visit (visit_detail table)
Both are integers, so joining visit_occurrence_id to visit_detail_id produces
NO TYPE ERROR, but randomly matches unrelated records where IDs happen to be
equal. This silently corrupts analytical results.
The cohort table is a RESULTS table with unique naming:
- Uses subject_id (not person_id) to identify patients
- This is the ONLY table in OMOP CDM that uses subject_id
- All clinical tables use person_id for patient identity
The ONLY valid join is:
cohort.subject_id = clinical_table.person_id
OMOP clinical tables have both standard concept_id columns (e.g., condition_concept_id)
and source concept_id columns (e.g., condition_source_concept_id). When you need to
join to concept for BOTH, you must use separate aliases. Reusing the same alias causes:
1. Ambiguous references: Which c.concept_id does the ON clause refer to?
2. Last-join-wins: Second JOIN overwrites/conflicts with first JOIN
3. Wrong data returned: You get source when you wanted standard (or vice versa)
4. Silent errors: SQL doesn't error, but results are semantically incorrect
concept_ancestor has two concept_id columns:
- ancestor_concept_id: The parent/higher-level concept
- descendant_concept_id: The child/lower-level concept
When joining to the concept table to retrieve concept_name, the join column
determines WHICH concept's name you get.
The concept table has a primary key (concept_id) and several descriptive columns
(concept_name, concept_code, vocabulary_id, domain_id, etc.). Joining on
descriptive columns instead of the primary key causes:
1. Non-unique matches: concept_name is not unique, causing cartesian joins
2. String matching issues: Case sensitivity, trailing spaces, encoding
3. Performance: String joins are much slower than integer joins
4. Semantic incorrectness: Foreign keys should reference primary keys
concept_relationship has two concept_id columns:
- concept_id_1: The source/origin concept (what you're mapping FROM)
- concept_id_2: The target/destination concept (what you're mapping TO)
When joining to the concept table twice to retrieve names for both concepts,
developers often swap the join columns, causing:
- The "source" alias to actually show the target concept's name
- The "target" alias to actually show the source concept's name
- Completely reversed mapping semantics
The relationship table is a reference table in OMOP that describes relationship types
(e.g., 'Maps to', 'Subsumes', 'Is a', 'Has form'). The concept_relationship table
references relationship via the relationship_id column (VARCHAR FK to
relationship.relationship_id).
The concept_synonym table provides alternative names for concepts:
- concept.concept_id = 123 → "Type 2 diabetes mellitus" (primary name)
- concept_synonym.concept_id = 123, concept_synonym_name = "Diabetes mellitus type 2"
- concept_synonym.concept_id = 123, concept_synonym_name = "T2DM"
Joining on name strings is unreliable because:
1. Names are not unique identifiers
2. Synonyms may match concept names from different concepts
3. String matching is error-prone (case sensitivity, whitespace)
The concept_class table is a reference table in OMOP that describes concept classes
(e.g., 'Clinical Drug', 'Ingredient', 'Procedure', 'Clinical Finding'). The concept
table references concept_class via the concept_class_id column (VARCHAR FK to
concept_class.concept_class_id).
The domain table is a reference table in OMOP that describes domains
(e.g., 'Condition', 'Drug', 'Procedure', 'Measurement'). The concept table
references domain via the domain_id column (VARCHAR FK to domain.domain_id).
The vocabulary table is a reference table in OMOP that describes vocabularies
(e.g., 'SNOMED', 'ICD10CM', 'RxNorm'). The concept table references vocabulary
via the vocabulary_id column (VARCHAR FK to vocabulary.vocabulary_id).
Without the domain filter, a join can produce incorrect results because
cost_event_id is a polymorphic foreign key that can reference different tables.
Example: drug_exposure_id = 123 and procedure_occurrence_id = 123 are DIFFERENT
events, but without cost_domain_id filter, both would match cost_event_id = 123.
The death table has a unique structure with person_id as both primary key
and the ONLY foreign key to other clinical tables. It has NO visit_occurrence_id,
provider_id, or care_site_id columns.
The ONLY valid join is:
death.person_id = visit_occurrence.person_id
SELECTd.person_id,vo.visit_occurrence_id,d.death_date,vo.visit_end_dateFROMdeathdJOINvisit_occurrencevoONd.person_id=vo.person_idWHEREd.death_date=vo.visit_end_date-- Temporal filter in WHERE
The drug_strength table is a vocabulary table that contains strength information
for drug formulations. It has NO clinical event columns like drug_exposure_id,
person_id, or visit_occurrence_id.
The ONLY valid join is:
drug_exposure.drug_concept_id = drug_strength.drug_concept_id
Era tables are DERIVED/AGGREGATED tables built from event tables:
- condition_era is derived from condition_occurrence
- drug_era is derived from drug_exposure
- dose_era is derived from drug_exposure
They represent continuous time periods, not discrete clinical events.
Era tables have NO foreign keys to visit, provider, or care_site.
They ONLY have:
- person_id (FK to person)
- *_concept_id (FK to concept)
Any join to visit/provider/care_site is semantically impossible.
fact_relationship links ANY two facts across the entire OMOP CDM.
Without domain filtering, ID values collide across tables:
- measurement_id = 123 (in measurement table)
- condition_occurrence_id = 123 (in condition_occurrence table)
- procedure_occurrence_id = 123 (in procedure_occurrence table)
When a query JOINs concept (or concept_relationship) to a clinical fact table, the join's ON-clause must connect them via concept_id (not domain_id, vocabulary_id, or any other column). A wrong-key join produces an implicit cross-join or silently empty results with no database error.
The query references a clinical fact table's standard concept field (condition_concept_id, drug_concept_id, etc.), AND
The query references concept or concept_relationship, AND
None of the recognized linkage patterns are present:
Direct JOIN: clinical.<x>_concept_id = concept.concept_id.
Implicit (comma) join with the same equality in WHERE.
Indirect bridge through a CTE that selects a concept_id column from a vocabulary source.
JOIN with unqualified concept_id in the ON clause (when concept is the join target).
Suppression for subquery-only usage. When concept (or concept_relationship) appears only inside a Subquery node — never in the outer query's FROM/JOIN list — the rule suppresses. Examples of suppressed shapes:
In these shapes there's no JOIN to validate; the relevant data-quality concerns (deprecated concepts, trailing whitespace in concept_name, ambiguous multi-row lookups) are covered by other rules: invalid_reason_enforcement, concept_name_whitespace, and standard_concept_enforcement. Suppressing here avoids issuing a join-shaped warning on a non-join query.
LEFT JOIN semantics:
- Returns ALL rows from the left table
- Matched rows from right table have values
- Unmatched rows from right table have NULL values
When a WHERE clause filters on a right table column with non-NULL conditions:
- Rows where right table column is NULL are filtered out
- This defeats the purpose of LEFT JOIN
- Effectively converts LEFT JOIN to INNER JOIN
- Developer likely intended INNER JOIN or should move filter to JOIN ON
This is a common SQL anti-pattern that produces unexpected results.
OMOP semantic rule: Verify that 'Maps to' relationship is used in the correct direction: - concept_id_1 should be the source concept - concept_id_2 should be the standard concept
The note_nlp table contains NLP-extracted entities from clinical notes.
It's a vocabulary-like extension table that has NO direct patient context
columns (no person_id, visit_occurrence_id, provider_id).
The ONLY valid join is:
note_nlp.note_id = note.note_id
A patient can have multiple insurance coverage periods over time:
- person_id = 12345, coverage from 2020-01-01 to 2020-12-31
- person_id = 12345, coverage from 2021-01-01 to 2022-06-30
- person_id = 12345, coverage from 2022-07-01 to 2024-12-31
If you join only on person_id, a drug exposure on 2021-06-15 will match
ALL THREE insurance periods, not just the active one!
person has location_id (foreign key to location.location_id) to identify
the patient's home address. Joining on other columns (e.g., person_id to
location_id) is semantically incorrect and produces wrong results.
preceding_visit_occurrence_id is a self-referential foreign key that links
to the previous visit for a patient. It MUST join to visit_occurrence.visit_occurrence_id.
Patient visit chain example:
- Visit ID 1: ER visit (preceding_visit_occurrence_id = NULL)
- Visit ID 2: Inpatient (preceding_visit_occurrence_id = 1)
- Visit ID 3: Follow-up (preceding_visit_occurrence_id = 2)
Clinical tables have provider_id (foreign key to provider.provider_id).
Joining on other columns (e.g., person_id to provider_id) is semantically
incorrect and produces wrong results.
provider has care_site_id (foreign key to care_site.care_site_id) to identify
the provider's practice location. Joining on other columns (e.g., provider_id
to care_site_id) is semantically incorrect and produces wrong results.
OMOP semantic rule OMOP_034: visit_detail records are nested within visit_occurrence. Queries using visit_detail should join to visit_occurrence via visit_detail.visit_occurrence_id = visit_occurrence.visit_occurrence_id, not via person_id alone.
Even if numeric values overlap (visit_occurrence_id=123, person_id=123 both exist),
they represent completely different entities:
- visit_occurrence_id = 123 → A specific visit/encounter
- person_id = 123 → A patient identifier
These are unrelated despite having the same number.
Detects filtering logic that implies clinical event dates occur in the future. Future event dates may indicate data quality issues or incorrect query logic.
BETWEEN with datetime columns and date literals excludes non-midnight times
on the end date:
WHERE measurement_datetime BETWEEN '2023-01-01' AND '2023-01-31'
-- '2023-01-31' is interpreted as '2023-01-31 00:00:00'
-- Excludes: '2023-01-31 08:30:00', '2023-01-31 23:59:59', etc.
-- SILENT DATA LOSS!
This is a common mistake that's hard to catch because:
- Query executes without error
- Returns results (just incomplete)
- Easy to miss in testing
OMOP Context:
OMOP CDM has parallel DATE and DATETIME columns:
Datetime columns (affected):
- condition_start_datetime, condition_end_datetime
- drug_exposure_start_datetime, drug_exposure_end_datetime
- measurement_datetime
- observation_datetime
- visit_start_datetime, visit_end_datetime
- procedure_datetime
- device_exposure_start_datetime, device_exposure_end_datetime
Date columns (safe with BETWEEN):
- condition_start_date, drug_exposure_start_date, etc.
A person cannot die before they are born. Queries filtering for death_date
before birth_datetime (or death year before year_of_birth) represent:
- Data quality issues (incorrect dates)
- Logic errors in the query
- Incorrect join conditions
Death dates in the future are impossible and represent:
- Data quality issues (incorrect death dates)
- Data entry errors (wrong year, wrong century)
- Logic errors in the query (wrong comparison operator)
Many OMOP tables have start_date and end_date columns representing
temporal events. A query with static date filters that forces the end
date to be before the start date is logically impossible and indicates
a WHERE clause error.
Example Violations:
-- ERROR: Start must be after June, but end must be before January
WHERE condition_start_date > '2023-06-01'
AND condition_end_date < '2023-01-01'
-- ERROR: Start >= June 1, but end < June 1
WHERE drug_exposure_start_date >= '2023-06-01'
AND drug_exposure_end_date < '2023-06-01'
-- ERROR: Start = June 15, but end = May 1
WHERE visit_start_date = '2023-06-15'
AND visit_end_date = '2023-05-01'
Valid Patterns (no violation):
-- OK: Overlapping range is possible
WHERE condition_start_date > '2023-01-01'
AND condition_end_date < '2023-12-31'
-- OK: Could start and end on same day
WHERE visit_start_date >= '2023-06-01'
AND visit_end_date >= '2023-06-01'
-- OK: Dynamic comparison (not static date literals)
WHERE condition_end_date < condition_start_date + INTERVAL '30 days'
OMOP semantic rules CLIN_011, CLIN_045, OMOP_052, OMOP_244, OMOP_529, OMOP_551:
Detects logically impossible date constraints where static filters force
end_date < start_date for the same record
When a query compares date columns across two different clinical event tables (e.g. co.condition_start_date > de.drug_exposure_start_date), the later event must be bounded by observation_period_end_date. Without that bound, the comparison reaches beyond the patient's observed follow-up window — producing immortal-time bias and similar follow-up-window errors in cohort analyses.
The rule is named "future information leakage" for historical reasons; the underlying problem is a missing right-censoring/follow-up-window bound, not ML-style look-ahead leakage.
The rule looks for inequality comparisons (<, <=, >, >=) in WHERE / JOIN ON clauses where:
Both sides are columns from clinical fact tables (different tables on each side).
Both columns are date or datetime columns.
No upper-bound predicate against observation_period_end_date exists anywhere in the same query (BETWEEN ... AND observation_period_end_date also satisfies the bound).
Suppression contract. If observation_period is not joined at all, this rule stays silent and defers to temporal.observation_period_anchoring. The anchoring rule already fires for the same root cause and ships a coherent fix that introduces the JOIN. Firing both would double-count and would emit a patch referencing an op. alias the query never defines.
When observation_period IS joined but no upper bound is asserted, this rule fires with a self-contained patch: it resolves the actual alias the query uses for observation_period (e.g. op, or the bare table name when no alias is given) and substitutes it directly into the ADD patch — no <op> placeholder.
ADD: AND <later_qual>.<later_col> <= <op_alias>.observation_period_end_date — bounds the later event by the patient's observed follow-up window. The rule resolves <op_alias> from the query's actual JOIN list, so the emitted patch is directly applyable.
End date columns are frequently NULL because:
- Ongoing/chronic conditions without resolution
- Drug exposures with unknown end dates
- Single-point-in-time procedures
- Incomplete/ongoing visits
NULL in date arithmetic returns NULL, silently excluding rows from aggregations.
Example impact:
SELECT AVG(DATEDIFF(day, start_date, end_date))
FROM condition_occurrence
-- Returns NULL for rows with NULL end_date
-- Aggregation excludes these rows → biased results
JOIN observation_period op ON clinical_table.person_id = op.person_id AND clinical_table.date BETWEEN op.observation_period_start_date AND op.observation_period_end_date
OMOP semantic rule OMOP_033: When using observation_period to validate patient enrollment, the clinical event date must fall BETWEEN observation_period_start_date AND observation_period_end_date.
Many clinical tables have multiple temporal columns with different nullability:
- A required date column (NOT NULL, always populated)
- Optional datetime columns (may be NULL)
- Optional end date columns (may be NULL for ongoing events)
Using nullable columns for temporal filtering can silently exclude records
where those columns are NULL, leading to incomplete result sets.
Covered Tables and Columns:
condition_occurrence:
- condition_start_date: Required (NOT NULL)
- condition_start_datetime: Optional (nullable)
- condition_end_date: Optional (nullable, often NULL for ongoing conditions)
Example impact:
-- BAD: Uses nullable column
SELECT COUNT(*) FROM drug_exposure
WHERE drug_exposure_end_date BETWEEN '2023-01-01' AND '2023-12-31'
-- May exclude records where end_date is NULL
-- GOOD: Uses required column
SELECT COUNT(*) FROM drug_exposure
WHERE drug_exposure_start_date BETWEEN '2023-01-01' AND '2023-12-31'
-- Includes all records (start_date is always populated)
Correct patterns (no violation):
-- Use required date column
WHERE condition_start_date BETWEEN '2023-01-01' AND '2023-12-31'
-- Or use COALESCE if datetime precision is needed
WHERE COALESCE(condition_start_datetime, condition_start_date) > '2023-01-01'
-- Or explicitly handle NULLs
WHERE condition_start_datetime > '2023-01-01'
AND condition_start_datetime IS NOT NULL
Clinical event dates before 1900 are implausible and represent:
- Data quality issues (incorrect event dates)
- Data entry errors (wrong year, wrong century)
- Logic errors in the query (accidentally filtering for ancient dates)
OMOP semantic rules OMOP_004, OMOP_005, OMOP_024, OMOP_025, OMOP_026: Validates that column data types are compatible in JOIN conditions and WHERE clauses.
All columns ending in _concept_id store integer values representing OMOP concepts.
Comparing these columns with quoted string literals forces the database to perform
implicit type conversion, which:
- Degrades query performance (string-to-integer conversion for every row)
- May fail on some database engines with strict type checking
- Indicates sloppy coding practices
- Can produce unexpected results depending on database collation/casting rules
OMOP vocabulary data sometimes contains concept names with trailing whitespace:
- 'Type 2 diabetes mellitus ' (note the trailing space)
- 'Metformin ' (multiple trailing spaces)
When queries use exact equality (=) without trimming, they may fail to match:
- concept_name = 'Metformin' won't match 'Metformin ' (with trailing spaces)
- This causes silent failures - no error, just missing results
This is particularly problematic because:
- Users don't expect whitespace in concept names
- The mismatch is invisible in most query tools
- Data quality varies across vocabulary versions
The episode table in OMOP CDM represents aggregated clinical events spanning
multiple dates (e.g., treatment regimens, disease episodes, hospitalizations).
The episode_concept_id column defines the TYPE of episode being tracked.
Querying the episode or episode_event tables without filtering by
episode_concept_id can lead to:
- Poor query performance (scanning all episode types)
- Semantic ambiguity (unclear query intent)
- Logical errors (mixing incompatible episode types)
Common episode types include:
- Disease Episode (concept_id 32533)
- Treatment Episode
- Hospitalization Episode
- Drug Era Episode
The fact_relationship table links two clinical events together via a relationship.
In most cases, linking an event to itself doesn't make semantic sense:
- A measurement "preceded by" itself
- A condition "followed by" itself
- A procedure "causally related to" itself
While there might be extremely rare valid cases, queries that explicitly
filter for or create self-referential relationships typically indicate:
- Data quality issues
- Logic errors in ETL processes
- Incorrect query logic
The fact_relationship table links facts across different domain tables using
relationship types defined by relationship_concept_id. Querying fact_relationship
without filtering by relationship_concept_id can lead to:
- Poor query performance (scanning all relationship types)
- Semantic ambiguity (unclear query intent)
- Logical errors (mixing incompatible relationship types)
The fact_relationship table contains three concept_id columns that reference
the concept table:
- domain_concept_id_1: Domain of the first fact
- domain_concept_id_2: Domain of the second fact
- relationship_concept_id: Type of relationship between facts
When joining to the concept table to validate or filter these concept IDs,
queries must check invalid_reason to ensure only valid (current) concepts
are used. Invalid concepts may represent deprecated relationships or domains.
The note_nlp.offset column stores character positions as VARCHAR, not INTEGER.
Developers often mistakenly treat it as a numeric column because:
1. Positions are conceptually numeric
2. Most OMOP CDM position/ID fields are integers
3. The name "offset" suggests a numeric value
This leads to:
- String vs numeric comparison semantics (e.g., '9' > '100' is true)
- Using it in JOINs (semantically incorrect - it's a position, not a key)
- Arithmetic operations without proper casting
The note_nlp.nlp_date column stores when the NLP processing was performed,
NOT when the clinical event occurred. This is a critical semantic distinction:
- nlp_date: Processing timestamp (e.g., when cTAKES ran on the note)
- note.note_date: Actual clinical date of the note/event
Using nlp_date for temporal filtering produces incorrect results because:
1. NLP processing often happens in batches, long after the clinical event
2. Re-running NLP changes nlp_date but not the clinical date
3. The same note processed twice would have different nlp_date values
4. Cohorts defined by nlp_date are non-reproducible across NLP runs
OMOP CDM schema validation: Validates that columns referenced in SQL queries exist in the OMOP CDM schema. Catches common errors like using concept_ancestor columns on concept_relationship table.
UNION removes duplicates by sorting and deduplicating results. For clinical
event data, this is almost always wrong because:
1. Legitimate duplicates: Two events that look identical are still separate
- Two ER visits on same day (morning: chest pain, evening: injury)
- Two drug prescriptions on same day (different physicians/episodes)
- Multiple measurements on same day (repeated tests, different encounters)
- Two procedures on same day (staged surgeries, emergency + planned)
2. Silent data loss: UNION drops events without warning or error
3. Incorrect counts: Event counting becomes inaccurate
- Patient had 5 visits, but UNION shows 3
- Cost analysis missing events
4. Performance: UNION is slower (must sort and deduplicate)
Clinical events have unique primary keys (condition_occurrence_id,
drug_exposure_id, etc.) that make them distinct even if other columns
appear identical. UNION operates on selected columns only, not primary keys.
Each OMOP domain has its own concept_id column:
- condition_occurrence.condition_concept_id
- drug_exposure.drug_concept_id
- procedure_occurrence.procedure_concept_id
- measurement.measurement_concept_id
- observation.observation_concept_id
UNION queries that mix these without domain labels create ambiguous results.
Example impact:
SELECT condition_concept_id AS concept_id
FROM condition_occurrence
UNION ALL
SELECT drug_concept_id AS concept_id
FROM drug_exposure
-- Returns: [201826, 1545958, 313217, ...]
-- Which are conditions? Which are drugs? UNKNOWN!
-- Results are uninterpretable without domain context
OMOP semantic rule: When filtering clinical tables by specific *_concept_id values, warn if concept_id = 0 (unmapped records) is not explicitly handled.
Vocabulary tables contain standardized reference data that all analytical queries
depend on. Modifying these tables can:
- Break all downstream queries that reference affected concepts
- Corrupt the vocabulary structure
- Require a full vocabulary reload to recover
Unqualified column references in multi-table queries create ambiguity:
SELECT person_id, condition_concept_id
FROM condition_occurrence co
JOIN person p ON co.person_id = p.person_id
WHERE person_id = 12345
-- WRONG: Which person_id? co.person_id or p.person_id?
Common ambiguous columns in OMOP:
- person_id: In nearly every clinical table + person table
- provider_id: In clinical event tables + provider table
- care_site_id: In clinical events, visit_occurrence, care_site, person
- visit_occurrence_id: In clinical events + visit_occurrence
- visit_detail_id: In clinical events + visit_detail
This causes:
1. SQL errors: Database rejects query due to ambiguous column
2. Unpredictable behavior: Database picks wrong table's column
3. Silent bugs: Query executes but returns wrong data
attribute_definition is a metadata table with:
- Single column: attribute_definition_id (primary key)
- No foreign keys: edges = {} in CDM schema
- No semantic relationships to clinical or vocabulary data
- Legacy/optional status: May not be used in modern CDM implementations
Any JOIN involving attribute_definition is semantically incorrect because there
are no valid foreign key relationships. The table cannot be meaningfully joined
to person, condition_occurrence, concept, or any other OMOP table.
Comma-separated FROM clauses without proper join conditions create
Cartesian products (cross joins):
SELECT * FROM condition_occurrence, drug_exposure
WHERE condition_concept_id = 201826
-- WRONG: No join condition! Creates 10M × 50M = 500 BILLION rows!
Each clinical table in OMOP has millions of rows:
- condition_occurrence: ~10M rows
- drug_exposure: ~50M rows
- measurement: ~100M rows
- observation: ~50M rows
Without a join condition (co.person_id = de.person_id), the query
creates every possible combination of rows from both tables.
This causes:
- Out of memory errors
- Database crashes
- Production system locks
- Hours of wasted compute time
The concept_ancestor table is a pre-computed transitive closure table
that contains ALL hierarchical relationships across all levels:
- Direct parent-child relationships (1 hop)
- Grandparent-grandchild relationships (2 hops)
- All deeper ancestor-descendant relationships (N hops)
This table is automatically built by traversing the concept_relationship
table and following all hierarchical relationships where:
- relationship_id = 'Is a', 'Subsumes', or other relationships
- relationship.defines_ancestry = 1
When a query already uses concept_ancestor for hierarchy traversal,
also joining concept_relationship with hierarchical relationship_id
filters is redundant because:
1. concept_ancestor already contains this information
2. Mixing both tables may cause duplicate rows
3. It may produce incorrect counts or aggregations
4. It adds unnecessary complexity and performance overhead
SELECTDISTINCTca.descendant_concept_idFROMconcept_ancestorcaJOINconcept_relationshipcrONca.descendant_concept_id=cr.concept_id_1WHEREca.ancestor_concept_id=201820ANDcr.relationship_id='Is a'
SELECTcr.concept_id_2FROMconcept_ancestorcaJOINconceptcONca.descendant_concept_id=c.concept_idJOINconcept_relationshipcrONc.concept_id=cr.concept_id_1WHEREca.ancestor_concept_id=201820ANDcr.relationship_id='RxNorm has dose form'
OMOP vocabulary rule: concept_code is unique only within a vocabulary. Any filter on concept_code must also include a vocabulary_id filter in the same scope, otherwise the query may silently match unintended concepts from other vocabularies.
OMOP vocabulary rule: Filtering by concept_name is an anti-pattern because: 1. Concept names are not guaranteed to be unique (multiple concepts can share a name) 2. Concept names can change across vocabulary versions, breaking queries silently 3. Concept names may have variations (spelling, abbreviations, etc.)
concept_relationship contains direct relationships only:
- Concept A "Subsumes" Concept B (one hop)
- Concept B "Subsumes" Concept C (one hop)
To find all descendants of A (both B and C), users sometimes chain joins:
cr1 → cr2 → cr3 (each hop requires another join)
Issues with manual chaining:
1. Incomplete: Only gets descendants at specific depth (e.g., exactly 3 hops)
2. Fragile: Misses concepts with multiple inheritance paths
3. Performance: Multiple self-joins are slow
4. Complexity: Hard to maintain and understand
The concept_ancestor table pre-computes ALL transitive hierarchical paths
and is optimized for hierarchy traversal queries.
Analysts may accidentally run destructive operations on production data:
DELETE FROM measurement WHERE measurement_date < '2010-01-01'
-- Just deleted thousands of historical measurements!
UPDATE condition_occurrence SET condition_end_date = condition_start_date
-- Modified production patient data without governance!
TRUNCATE TABLE drug_exposure
-- Deleted ALL drug exposure records!
DROP TABLE visit_occurrence
-- DISASTER! Lost all visit data!
In SQL, HAVING without GROUP BY treats the entire result set as a single group.
While syntactically valid in some databases (MySQL, PostgreSQL), this pattern
is almost always a mistake in OMOP queries because:
- HAVING is meant to filter aggregated groups created by GROUP BY
- Without GROUP BY, you should use WHERE instead for filtering
- This indicates the developer forgot to add GROUP BY
- Results in unexpected behavior where aggregate functions apply to entire table
Why this is wrong:
The intent is usually to group by some column and filter those groups,
but the developer forgot the GROUP BY clause. This produces incorrect results
where the HAVING condition applies to the entire dataset as one group.
The metadata table is a metadata table with no primary key and no foreign key
relationships to clinical data. It stores CDM instance metadata such as:
- ETL provenance information
- Data characterization results
- CDM instance-level metrics
It has no relationship to patient-level clinical data.
Joining metadata to clinical tables (person, condition_occurrence, etc.)
is semantically incorrect and indicates confusion about the table's purpose.
Why this is wrong:
- metadata has no person_id or any FK to clinical tables
- metadata_id does NOT link to clinical event IDs
- The table stores instance-level metadata, not patient data
- Joining creates meaningless results
Query the metadata table standalone to retrieve CDM instance information. Do not join it to clinical tables like person, condition_occurrence, or drug_exposure.
Primary key columns in OMOP CDM tables are unique by definition:
- condition_occurrence_id, drug_exposure_id, procedure_occurrence_id, etc.
Using DISTINCT on these columns is either:
1. Redundant: If querying a single table without joins
2. Hiding a join problem: If joins introduce duplicates (Cartesian product)
The presence of DISTINCT on a primary key suggests:
- Misunderstanding of data uniqueness
- Missing or incorrect join conditions
- Unnecessary performance overhead
The standard_concept column has three values:
- 'S': Standard concept (for use in clinical data)
- 'C': Classification concept (hierarchical grouping only)
- NULL: Non-standard concept (deprecated, source-specific)
Classification concepts ('C') are high-level groupings and should NOT be used
in clinical table *_concept_id fields. They're meant for vocabulary hierarchy
and navigation, not patient data.
Using OR logic to include both 'S' and 'C' dilutes data quality by mixing
clinical concepts with non-clinical hierarchy concepts.
Common mistake scenarios:
1. Concept set building with overly permissive filters
(standard_concept = 'S' OR standard_concept = 'C')
2. Using IN clause with both values
standard_concept IN ('S', 'C')
3. Misunderstanding that 'C' concepts are not for clinical use
Type concept columns (*_type_concept_id) reference concepts that describe the
provenance or type of a clinical record:
- condition_type_concept_id: EHR record, Insurance claim, etc.
- drug_type_concept_id: Prescription written, Dispensed in pharmacy, etc.
These type concepts have domain_id = 'Type Concept', not clinical domains like
'Condition' or 'Drug'. When queries join type_concept_id to concept and filter
by clinical domains, they return zero results.
OMOP semantic rule (OMOP_014): The *_type_concept_id columns (e.g., condition_type_concept_id, drug_type_concept_id) represent the provenance of the record (e.g., EHR, claim, patient-reported), not clinical categories. Do not use them to filter for clinical subtypes.
Use the primary concept_id column (e.g., condition_concept_id) for clinical filtering. type_concept_id should only be used to understand data source/provenance.
Use cohort_definition_name or cohort_definition_id for filtering. cohort_definition_syntax should only be retrieved, not filtered with string patterns.
Joining person to condition_occurrence without aggregation can produce
multiple rows per person, leading to incorrect counts or analysis. For example:
- Patient A has 3 condition_occurrence records for diabetes (across 3 visits)
- Query joins person to condition_occurrence: SELECT p.person_id, co.condition_start_date
- Result: 3 rows for Patient A instead of 1
- Counting rows gives "3 patients" when only 1 patient exists
Detection heuristics:
- Query joins person to condition_occurrence on person_id
- No GROUP BY clause present
- No DISTINCT in SELECT
- No aggregation functions (COUNT, MIN, MAX, etc.)
visit_detail records are nested within visit_occurrence records. A condition can
be linked to a visit_detail, and that visit_detail belongs to a visit_occurrence.
If a query joins condition_occurrence to visit_detail but then tries to access
visit_occurrence columns without properly joining to visit_occurrence, it will
either fail (if the table isn't in FROM/JOIN) or produce incorrect results.
Example violation:
-- BAD: References vo.visit_start_date without joining visit_occurrence
SELECT co.*, vo.visit_start_date
FROM condition_occurrence co
JOIN visit_detail vd ON co.visit_detail_id = vd.visit_detail_id
-- ERROR: vo referenced but not joined
Correct pattern:
-- GOOD: Properly joins through visit_occurrence
SELECT co.*, vo.visit_start_date
FROM condition_occurrence co
JOIN visit_detail vd ON co.visit_detail_id = vd.visit_detail_id
JOIN visit_occurrence vo ON vd.visit_occurrence_id = vo.visit_occurrence_id
The cost table has a payer_plan_period_id column (INTEGER FK) that references
payer_plan_period.payer_plan_period_id (INTEGER PK). This is the correct join key.
Using death_cause_source_concept_id in WHERE clauses or JOINs is incorrect for
analytical work because:
- It represents source/local vocabulary codes, not standardized OMOP concepts
- Analytical queries should use standardized concepts for reproducibility
- Source concepts are intended for ETL validation, mapping verification, or provenance tracking
The death table has a minimal schema:
- person_id (FK to person)
- death_date
- death_datetime
- death_type_concept_id
- death_cause_concept_id
- death_cause_source_value
- death_cause_source_concept_id
It does NOT have foreign keys to clinical event tables like:
- visit_occurrence_id
- condition_occurrence_id
- drug_exposure_id
- procedure_occurrence_id
The only valid join from death to other tables is via person_id.
Why this is wrong:
Developers sometimes mistakenly try to join death directly to clinical event
tables using incorrect column mappings, such as:
- death.person_id = visit_occurrence.visit_occurrence_id (wrong types)
- death.death_type_concept_id = condition_occurrence.condition_concept_id (semantically wrong)
Counting rows in drug_exposure without awareness of multiple records per person
per drug can produce misleading statistics. For example:
- Patient A has 3 drug_exposure records for metformin (3 refills)
- Query: SELECT drug_concept_id, COUNT(*) FROM drug_exposure GROUP BY drug_concept_id
- Result: exposure_count = 3 for metformin
- Misleading: This counts prescription fills, not unique patients
Common mistake: Using COUNT(*) when you want to count unique patients.
Detection patterns:
- Query uses COUNT(*) or COUNT(column) on drug_exposure table
- COUNT does NOT use DISTINCT person_id
- Suggests using COUNT(DISTINCT person_id) or drug_era table
The sig field contains unstructured free-text that varies widely across sites:
- "1 tab bid"
- "take one tablet twice daily"
- "Take 1-2 tablets by mouth every 4-6 hours as needed"
Parsing this for structured dosing data is unreliable and error-prone.
The drug_strength table provides standardized, structured dosing information.
quantity in the drug_exposure table represents the amount dispensed (e.g., 30 tablets).
Negative values are never valid and indicate data quality issues or query logic errors.
drug_strength is a vocabulary table with temporal validity:
- Same drug_concept_id may have multiple strength records over time
- Formulations change (new strengths, discontinued strengths)
- invalid_reason IS NULL indicates currently valid records
Querying without validity filters returns BOTH current AND historical strengths
→ incorrect calculations, duplicate results
Example impact:
-- Returns 3 rows: old formulation (100mg), current (150mg), deprecated (200mg)
SELECT drug_concept_id, amount_value, amount_unit_concept_id
FROM drug_strength
WHERE drug_concept_id = 19078461
-- Which strength is correct? Calculation uses wrong value!
The location_history table tracks location changes for different entity types
using a polymorphic foreign key pattern:
- entity_id: Polymorphic FK to person_id, provider_id, or care_site_id
- domain_id: Discriminator identifying which table entity_id refers to
- 'Person' → entity_id refers to person.person_id
- 'Provider' → entity_id refers to provider.provider_id
- 'Care Site' → entity_id refers to care_site.care_site_id
Without filtering on domain_id, joins on entity_id are ambiguous because:
- Integer IDs can collide across tables (person_id=123, provider_id=123)
- Query may incorrectly match entities from wrong domain
- Results will include mixed entity types
SELECTperson_id,AVG(value_as_number)FROMmeasurementWHEREmeasurement_concept_id=3004410ANDunit_concept_idIN(8753,8840)-- mmol/L and mg/dLGROUPBYperson_id
Correct patterns:
SELECTAVG(value_as_number)ASavg_glucose_mmolFROMmeasurementWHEREmeasurement_concept_id=3004410ANDunit_concept_id=8753-- mmol/L only
The measurement table can contain duplicate records with the same:
- person_id
- measurement_concept_id
- measurement_date
These duplicates can occur from:
- ETL processing errors (same source record loaded multiple times)
- Data quality issues (duplicate submissions from source systems)
- Integration artifacts (same measurement from different data feeds)
- Multiple precision levels (e.g., 5.5 and 5.52 for same measurement)
Unlike drug_exposure (where refills are expected) or condition_occurrence
(where recurring diagnoses are expected), duplicate measurements at the
same date typically indicate data quality issues rather than clinical reality.
Detection patterns:
Query performs aggregations or counts on measurement table without:
- Grouping by natural key (person_id, measurement_concept_id, measurement_date)
- Using DISTINCT
- Using deduplication logic (ROW_NUMBER, etc.)
Violation example:
-- BAD: Counts may include duplicates
SELECT person_id, AVG(value_as_number) AS avg_glucose
FROM measurement
WHERE measurement_concept_id = 3004410
GROUP BY person_id
-- If person has 2 identical measurements on same date, average is skewed
Correct patterns:
-- GOOD: Group by natural key to handle duplicates
SELECT person_id, measurement_date, measurement_concept_id,
AVG(value_as_number) AS avg_value
FROM measurement
WHERE measurement_concept_id = 3004410
GROUP BY person_id, measurement_date, measurement_concept_id
-- GOOD: Use ROW_NUMBER to deduplicate
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY person_id, measurement_concept_id, measurement_date
ORDER BY measurement_datetime NULLS LAST
) AS rn
FROM measurement
)
SELECT * FROM ranked WHERE rn = 1
-- GOOD: Use DISTINCT
SELECT DISTINCT person_id, measurement_concept_id, measurement_date
FROM measurement
Group by natural key (person_id, measurement_concept_id, measurement_date), use DISTINCT, or apply explicit deduplication logic like ROW_NUMBER() OVER (PARTITION BY person_id, measurement_concept_id, measurement_date).
operator_concept_id indicates the comparison operator for value_as_number.
Only 5 specific concept_ids are valid operators in OMOP CDM.
Using any other concept_id is incorrect and will cause data integrity issues.
In the measurement table, range_low and range_high represent the normal
reference range for a measurement. By definition, range_low must be ≤ range_high.
A query that filters for range_low > range_high is logically impossible
unless it's a data quality check.
Add a unit_concept_id constraint alongside the numeric threshold: AND m.unit_concept_id = . Look up the correct UCUM unit concept ID in the OMOP vocabulary (e.g. SELECT concept_id FROM concept WHERE concept_code = '%' AND vocabulary_id = 'UCUM').
Both CAN be populated simultaneously for the same measurement, but they typically
represent different aspects of the result. Filtering both with AND is usually
a logic error that will be overly restrictive.
note_nlp columns and their purposes:
- snippet: Short text excerpt around the NLP-extracted term (for context)
- lexical_variant: The exact text string found in the note
- note_nlp_concept_id: Standardized OMOP concept_id for the extracted entity
The observation table stores observation results in multiple format columns:
- value_as_number: for numeric results (e.g., 98.6, 120, 7.2)
- value_as_string: for text results (e.g., "Positive", "Negative", "High")
- value_as_concept_id: for coded results (e.g., concept_id for "Normal")
For a given observation row, typically only ONE of these columns is populated;
the others are NULL. They represent alternative formats for the same result,
not complementary data points.
observation_concept_id = "What are you measuring?" (e.g., "Blood pressure")
value_as_concept_id = "What is the answer?" (e.g., "High", "Low", "Normal")
Using the same concept_id for both means you're saying:
"I'm measuring Blood Pressure, and the answer is Blood Pressure" - which is nonsensical.
value_as_string is a VARCHAR field. Applying numeric comparison
operators directly is semantically incorrect:
- The database may silently coerce strings to numbers, returning
wrong results or errors when non-numeric strings are encountered.
- The proper column for numeric comparisons is value_as_number.
The episode_event table is a linking table that connects episodes to their
constituent clinical events. The table schema is:
- episode_id (FK to episode.episode_id)
- event_id (polymorphic ID to clinical event)
- episode_event_field_concept_id (indicates which domain event_id refers to)
The episode_event table does NOT have a person_id column.
Developers sometimes mistakenly try to:
1. Join episode_event directly to person on person_id (column doesn't exist)
2. Filter episode_event by person_id (column doesn't exist)
3. Select episode_event.person_id (column doesn't exist)
4. Use person_id in WHERE/ORDER BY/GROUP BY with episode_event
Why this is wrong:
The episode_event table is intentionally designed without person_id to avoid
denormalization. Person information is accessed through the episode table:
- episode_event contains the event linkages
- episode contains the person_id and episode metadata
- This ensures data consistency and proper normalization
Attempting to use person_id on episode_event:
- Causes SQL errors (column does not exist)
- Indicates misunderstanding of episode_event table structure
- Breaks query execution
The specimen table has a confusing naming pattern:
- specimen_id (INTEGER): OMOP primary key
- specimen_source_id (VARCHAR): Source system identifier (free-text)
The naming confusion:
- Most OMOP *_id columns are INTEGER foreign keys or primary keys
- Most source identifiers use *_source_value (VARCHAR)
- But specimen_source_id breaks this pattern - it LOOKS like a FK but is VARCHAR
This is the ONLY column in OMOP CDM that ends with _source_id as a free-text field.
All other *_source_id columns would be *_source_concept_id (INTEGER FKs to concept).
Common mistake:
Developers see "specimen_source_id" and assume it's a numeric foreign key
that can be joined to other tables. This is incorrect.
The admission source and discharge destination columns represent WHERE the
patient came from and WHERE they went, not WHAT condition they had or WHAT
was done to them.
Correct domains:
- Visit: Emergency Room Visit, Inpatient Visit, Outpatient Visit
- Place of Service: Home, Skilled Nursing Facility, Hospice, Rehabilitation
Incorrect domains:
- Condition: Diabetes, Myocardial Infarction (these are clinical diagnoses)
- Drug: Aspirin, Metformin (these are medications)
- Procedure: Appendectomy, Chemotherapy (these are treatments)
When queries use hardcoded concept IDs in these columns without verifying
the domain, they risk using clinical concepts as location concepts.
Queries that filter for visit_detail dates OUTSIDE the parent visit range
indicate a logic error or misunderstanding of the visit hierarchy:
- visit_detail_start_date should be >= visit_start_date
- visit_detail_end_date should be <= visit_end_date
The OMOP CDM has two separate temporal chains for visits:
1. visit_occurrence temporal chain:
- Uses preceding_visit_occurrence_id to link to previous visits
- visit_occurrence.preceding_visit_occurrence_id → visit_occurrence.visit_occurrence_id
2. visit_detail temporal chain:
- Uses preceding_visit_detail_id to link to previous visit details
- visit_detail.preceding_visit_detail_id → visit_detail.visit_detail_id
The visit_detail table does NOT have a preceding_visit_occurrence_id column.
visit_detail provides granular sub-visit information (ICU stay, ward transfer,
operating room), but critical context is stored in visit_occurrence:
- Overall visit type (inpatient, outpatient, ER)
- Visit-level dates (visit_start_date, visit_end_date)
- Visit-level provider and care site
- Admission source and discharge destination
Analyzing visit_detail without referencing visit_occurrence loses this context.
Clinical events are linked to visits via visit_occurrence_id. These events
should occur during the visit:
- event_date >= visit_start_date
- event_date <= visit_end_date (if not NULL)
If a query filters for event_date < visit_start_date, this suggests:
1. Wrong visit_occurrence_id (join error)
2. Data quality issue in the source data
3. Logic error in the query
The relationship vocabulary table uses boolean flags to indicate:
- is_hierarchical: Whether the relationship represents a hierarchy
- defines_ancestry: Whether the relationship defines ancestry paths
When filtering these columns, developers sometimes use incorrect value types:
- String literals: 'true', 'false', '1', '0'
- Invalid integers: 2, -1, or any value other than 0 or 1
This causes issues:
1. Type mismatch errors in strongly-typed databases
2. Incorrect comparisons (boolean vs string comparison semantics differ)
3. Performance problems (prevents index usage)
4. Silent failures or unexpected results
Why this is wrong:
Boolean columns should be compared with boolean-compatible values:
- In most SQL dialects, booleans are represented as 1 (TRUE) or 0 (FALSE)
- Comparing with strings requires implicit conversion that may fail
- Using invalid integers (2, -1, etc.) is semantically meaningless
- String comparisons have different semantics than boolean comparisons