Bias in studies of prenatal exposures using real-world data due to pregnancy identification method
Abstract
Background: Researchers typically identify pregnancies in healthcare data based on observed outcomes (e.g., delivery). This outcome-based approach misses pregnancies that received prenatal care but whose outcomes were not recorded (e.g., at-home miscarriage), potentially inducing selection bias in effect estimates for prenatal exposures. Alternatively, prenatal encounters can be used to identify pregnancies, including those with unobserved outcomes. However, this prenatal approach requires methods to address missing data. Methods: We simulated 10,000,000 pregnancies and estimated the total effect of initiating treatment on the risk of preeclampsia. We generated data for 36 scenarios in which we varied the effect of treatment on miscarriage and/or preeclampsia; the percentage with missing outcomes (5% or 20%); and the cause of missingness: (1) measured covariates, (2) unobserved miscarriage, and (3) a mix of both. We then created three analytic samples to address missing pregnancy outcomes: observed deliveries, observed deliveries and miscarriages, and all pregnancies. Treatment effects were estimated using non-parametric direct standardization. Results: Risk differences (RDs) and risk ratios (RRs) from the three analytic samples were similarly biased when all missingness was due to unobserved miscarriage (log-transformed RR bias range: -0.12-0.33 among observed deliveries; -0.11-0.32 among observed deliveries and miscarriages; and -0.11-0.32 among all pregnancies). When predictors of missingness were measured, only the all pregnancies approach was unbiased (-0.27-0.33; -0.29-0.03; and -0.02-0.01, respectively). Conclusions: When all missingness was due to miscarriage, the analytic samples returned similar effect estimates. Only among all pregnancies did bias decrease as the proportion of missingness due to measured variables increased.