When answering environmental health related questions, most often, only non-randomized observational data can be collected for ethical or practical reasons. The lack of randomized exposure prevents the identification of the causal effects on health outcomes. This barrier led the environmental epidemiology field to mainly consider associational models to examine relationships between environmental exposures and health outcomes. However, we believe that epidemiological researches should focus on the estimation of causal effects of plausible hypothetical interventions (e.g., reducing PM10) and suggest preventive environmental actions. Some statistical matching techniques exist to reconstruct plausible randomized experiments, which are the "gold standard" to establish causality. Based on an environmental epidemiology example, we will present, 1) how the conceptual and design stages can be performed prior to statistical estimation of causal effects, and 2) why this conceptual and computational work is relevant for making policy recommendations. A case-cross over study by Jeanjean et al. 2017 reported significant associations between multiple sclerosis (MS) relapse incidence and exposures to NO2, PM10, and O3. With the same data, we will show how constructing hypothetical experiments, thereby creating comparable groups of MS patients can provide results that are relevant for policy makers. Several epidemiological studies have reported significant associations between air pollution and multiple sclerosis (Oikonen et al. 2003, Gregory et al. 2008, Heydarpour et al. 2014, Angelici et al. 2016, Jeanjean et al. 2017), whereas other studies did not find any (Palacios et al. 2017). These conflicted findings demonstrate the gap of knowledge in understanding the causal air pollution-MS relationship. Our objective is to show that carefully avoiding confounding of the exposure assignment prior to any analysis can help to examine whether observed associations are truly causal.

Although the general population is exposed to multiple chemicals, most environmental epidemiology studies consider each chemical separately when estimating the adverse effects of environmental exposures, partly because of the lack of software accessible to environmental epidemiologists. Objective: We developed a statistical R package, accessible to environmental health scientists that can provide valid estimates for effects of environmental contaminants, dose-response relationships, and interactions in a multi-pollutant setting. Methods: This package implements the work of Oulhote et al (2017) that combines the G-formula, a maximum likelihood estimator, with the ensemble learning technique Super Learner. It incorporates a wide range of statistical techniques (e.g. generalized linear and additive models, random forest, extreme gradient boosting) to estimate and provide valid inference in multi-pollutant settings and reconstruct dose-response relationships non-parametrically. Results: This package will facilitate multi-pollutant analysis of environmental epidemiology. We also developed a user-friendly Shiny application that can be used by biomedical researchers modeling complex mixtures. We ran multiple simulations based on real scenarios and the proposed method yielded promising results across all the scenarios. Our approach was also able to estimate the true underlying structure of the data. We will present a tutorial describing the package functions and visualizations to illustrate our developed methods. Conclusion: This package will help to unravel the effects of chemical mixtures and their interactions in epidemiological studies. Youssef Oulhote, Marie-Abele Bind, Brent Coull, Chirag, Patel, Philippe Grandjean. 2017. Combining Ensemble Learning Techniques and G-Computation to Investigate Chemical Mixtures in Environmental Epidemiology Studies. Biorxiv. https://www.biorxiv.org/content/early/2017/06/30/147413.article-info

Causal inference in air pollution epidemiology often violates the no interference assumption. No interference assumes that the outcome for any unit is independent of the treatment assigned for other units. In air pollution, this assumption does not hold when pollutants are carried downwind. Instrumental variables (IVs) are used in causal analysis for observational studies when treatment is not considered random. An IV is considered valid when (i) the IV can explain the treatment, e.g. air pollution, conditional on covariates and (ii) the IV is independent of the outcome conditional on covariates. We use incoming wind speed (WS) in areal units as an IV and propose a novel way to estimate direct and spillover effects of air pollution exposure on health outcomes based on compliance groups. We consider compliers to be units with high incoming WS and high pollution or low incoming WS and low pollution. Defiers are units with low incoming WS and high pollution or high incoming WS and low pollution. Direct and spillover effects can be estimated from within the compliers and defiers. We apply our Bayesian model to a 500 Cities dataset estimating the effects of particulate matter on asthma.

Many causal inference methods for observational studies focus on a subset of units that exhibit covariate balance among treatment groups and thus plausibly reconstruct a hypothetical randomized experiment. For example, matching methodologies match units that are similar with respect to background covariates, and regression discontinuity designs restrict analyses to units around the discontinuity with covariate balance. For these methods, it is common to (1) restrict analyses to a single subset, and (2) analyze that subset as if it were from a completely randomized experiment, regardless of the level of covariate balance in that subset. Instead, we propose a method that (1) finds multiple subsets that plausibly reconstruct a hypothetical randomized experiment, (2) analyzes each subset conditional on the level of covariate balance that subset achieves, and (3) combines these analyses into a single point estimate and uncertainty interval. Using an empirical example, we show that our procedure yields more precise inferences than standard methodologies by utilizing more units in the study as well as taking advantage of assignment mechanisms that account for covariate balance.

Consider an analysis that aims to quantify the causal effects of the exposure to multiple pesticides on a health outcome. For ethical reasons, this important question cannot be answered by randomly assigning participants to different levels of pesticides. However, a precise and rigorous analysis of observational data can help answering this causal question. The reconstruction of hypothetical factorial randomized experiments is complex due to the lack of randomization in observational studies. Certain treatment combinations may also be so rare that, for some combinations, we have no measured outcomes. We propose to recreate a hypothetical fractional factorial experiment instead of a full-factorial experiment, and define the causal estimands of interest. We illustrate our method using biomedical data from NHANES and estimate the effects of four common pesticides on body mass index and C-reactive protein in a matched dataset. Finally, we contrast our method to standard model-based approaches in the original dataset.