On the Application of Bayesian Leave-One-Out Cross-Validation to Exoplanet Atmospheric Analysis
Abstract
Over the last decade exoplanetary transmission spectra have yielded an unprecedented understanding about the physical and chemical nature of planets outside our solar system. Physical and chemical knowledge is mainly extracted via fitting competing models to spectroscopic data, based on some goodness-of-fit metric. However, current employed metrics shed little light on how exactly a given model is failing at the individual data point level and where it could be improved. As the quality of our data and complexity of our models increases, there is an urgent need to better understand which observations are driving our model interpretations. Here we present the application of Bayesian leave-one-out cross validation to assess the performance of exoplanet atmospheric models and compute the expected log pointwise predictive density (elpd). elpd estimates the out of sample predictive accuracy of an atmospheric model at data point resolution providing interpretable model criticism. We introduce and demonstrate this method on synthetic HST transmission spectra of a hot Jupiter. We apply elpd to interpret current observations of HAT-P-41b and assess the reliability of recent inferences of H- in its atmosphere. We find that previous detections of H- are dependent solely on a single data point. This new metric for exoplanetary retrievals complements and expands our repertoire of tools to better understand the limits of our models and data. elpd provides the means to interrogate models at the single data point level, a prerequisite for robustly interpreting the imminent wealth of spectroscopic information coming from JWST.
1 Introduction
Our understanding of exoplanet atmospheres has fundamentally changed in the last decade due to the plethora of space- and ground-based spectroscopic observations of transiting exoplanets from facilities such as the Hubble Space Telescope (HST, e.g., Deming et al., 2013; McCullough et al., 2014; Sing et al., 2016), the Spitzer Space Telescope (e.g., Désert et al., 2011; Pont et al., 2013; Stevenson et al., 2017), the Very Large Telescope (VLT, e.g., Nikolov et al., 2016; Sedaghati et al., 2017), among others (e.g., Chen et al., 2017; Rackham et al., 2017; Chen et al., 2018; Espinoza et al., 2019; von Essen et al., 2019). In particular, the transmission spectra of exoplanets, which encode the apparent change in planetary size as a function of wavelength, have provided us with important constraints on the chemical composition of exoplanet atmospheres (e.g., Charbonneau et al., 2002; Deming et al., 2013; Sing et al., 2016), information regarding the prevalence of clouds and hazes (e.g., Pont et al., 2008; Nikolov et al., 2016; Kreidberg et al., 2014a; Benneke et al., 2019), and other atmospheric properties at the day-night terminator of the planet (see e.g., Madhusudhan, 2019, for a review).
The interpretation of these observations is routinely performed by means of model fitting and comparison techniques, i.e., atmospheric retrievals. Currently, atmospheric retrievals using Bayesian inference methods are widely employed to obtain parameter estimates from transit spectra (see e.g., Madhusudhan, 2018, for a review). Through comparing competing atmospheric models, multiple chemical species have been identified and proposed as the cause of observed absorption features (e.g., Nikolov et al., 2016; Sedaghati et al., 2017; Chen et al., 2017). The retrieved constraints have enabled initial results on the abundances of these chemical species (e.g., Kreidberg et al., 2014b; Barstow et al., 2017; MacDonald & Madhusudhan, 2017), as well as comparative studies searching for trends in the composition of exoplanet atmospheres (e.g., Madhusudhan et al., 2014; Pinhas et al., 2018; Fisher & Heng, 2018; Welbanks et al., 2019). Fundamentally, the conclusions drawn from exoplanetary atmospheric retrievals always depend on a comparison between competing models.
However, the inferences and conclusions made using these techniques can be conflicting. For instance, while a chemical species may be inferred using a specific data set and modeling strategy (e.g., Sedaghati et al., 2017; Chen et al., 2017), observations of the same planet obtained with different facilities and interpreted with different models, may not lead to the same conclusions (e.g., Espinoza et al., 2019; Spake et al., 2021). Likewise, the resulting best-fit solutions can sometimes be incompatible with the observations according to some goodness-of-fit metrics (e.g., MacDonald & Madhusudhan, 2019; Colón et al., 2020). Interpretations of the same observations using different models and model fitting techniques have also led to opposing conclusions (e.g., Sing et al., 2016; Barstow et al., 2017; Pinhas et al., 2019). Overall, these apparent disagreements highlight the need for a thorough understanding of the limits of our models and the sensitivity of our inferences on the spectroscopic observations used.
It is desirable to have a method to assess the robustness of model inferences and their sensitivity to individual observations. However, existing model assessment metrics such as the Bayesian evidence, , p-value, or Bayesian Information Criterion (BIC), provide a single global summary value for the performance of the model conditioned on the entire dataset (e.g., Madhusudhan & Seager, 2009; Benneke & Seager, 2013; Colón et al., 2020; Alderson et al., 2022), making it difficult to shed light on which specific observations affect the model performance. Furthermore, these commonly used metrics can be hard to interpret as their assessment of the model performance may be at odds with each other (e.g., Colón et al., 2020; Lewis et al., 2020; Sheppard et al., 2021). Besides these metrics, some studies have used an information content based approach (i.e., quantifying the change in the state of knowledge by making a measurement, e.g., Line et al., 2012; Batalha & Line, 2017) to estimate the wavelengths and the precisions required to better inform the parameters in atmospheric models of transmission spectra. However, these efforts have been limited to simple atmospheric models with a small number of parameters. Additionally, the information content based approach has yet to be successfully implemented as a tool for model comparison and model assessment in atmospheric retrievals of transmission spectra.
There is a need for complementary model performance metrics that can assess performance both at the dataset and individual data point level. These metrics must take into account the sources of uncertainty in our models, and be generally applicable to non-linear atmospheric models of increasing complexity. Crucially, for their successful application to atmospheric retrievals, these metrics must be computationally inexpensive. The advent of such tools is paramount upon the imminent data release from the James Webb Space Telescope (JWST) at wavelengths inaccessible by existing facilities and at unprecedented precisions (e.g., Greene et al., 2016). Such a model performance metric is also key to ensure reliable and robust inferences and conclusions from current and upcoming observations.
In this paper, we explore the application of the expected log pointwise predictive density estimated by Bayesian leave-one-out cross-validation (elpd) as an interpretable model selection, comparison, and criticism tool. While elpd is starting to be applied in some areas of astronomy (e.g., Morris et al., 2021; Meier Valdés et al., 2022; McGill et al., 2022; Neil et al., 2022), has yet to be used in the context of atmospheric retrievals of transmission spectra. elpd (e.g., Vehtari & Ojanen, 2012) is a metric that quantifies the out of sample predictive accuracy of a model, e.g., how well does the model predict unseen data. This Bayesian metric can provide an indication of model performance at the data point or dataset level. Besides single-model assessment, elpd is a powerful tool for model comparison as it can highlight the relative performance of two models at the data point resolution. Using the Pareto Smoothed Importance Sampling (PSIS; Vehtari et al., 2015) approximation for elpd we demonstrate that this metric is computationally feasible for exoplanet atmospheric models.
We begin by summarizing the method of atmospheric retrievals as well as the advantages and disadvantages of the statistical metrics currently employed for model selection in Section 2. In Section 3 we introduce the concept of Bayesian cross validation and the algorithm to compute elpd. We explain our atmospheric retrieval setup in Section 4 and in Section 5 we demonstrate the power of elpd in atmospheric retrievals of transmission spectra by testing elpd on synthetic observations of a hot Jupiter. Then, we use elpd to assess the robustness of recent inferences of the hydrogen anion H- in the atmosphere of the hot Jupiter HAT-P-41b in Section 6. We conclude by summarizing our findings and exploring future research directions in Section 7.
2 Model Assessment in Exoplanet Atmospheric Retrievals
Atmospheric retrieval frameworks are powerful tools to infer the atmospheric properties of an exoplanet from its spectrum. At their core these tools couple an atmospheric model describing the physical state of the planetary atmosphere, a prior distribution over the possible values of the model parameters, and a parameter estimation scheme. Although a wide range of parameter estimation schemes have been used (see e.g., Madhusudhan, 2018, for a review), nested sampling (e.g., Skilling, 2006; Feroz et al., 2009; Buchner et al., 2014) prevails in the literature.
2.1 Frequentist Hypothesis Testing
Fit quality metrics like and corresponding p-values are commonly used in retrieval studies to assess whether or not a model adequately explains the data (e.g., Colón et al., 2020; Lewis et al., 2020; Sheppard et al., 2021; Alderson et al., 2022). Studies generally compute the reduced , = where is the degrees of freedom, and values greater than one are considered “poor” fits, while values smaller than one are considered an overfit (Andrae et al., 2010). Similarly, the p-value is interpreted as the probability that random noise can result in the observed data given that the model being tested is true, with some studies considering p-values indicative of “poor” fits (e.g., Colón et al., 2020).
However, these fit quality metrics have two main pitfalls within the context of atmospheric retrievals. First, the atmospheric models used in exoplanetary retrievals are not linear in their parameters. As a result, estimating the true number of degrees of freedom of a given model to compute is not possible. Second, current studies ignore the uncertainty in the estimate of which in practice can be difficult to estimate. These limitations affect estimates of the p-value as this metric relies on the assumption that statistics are appropriate for the model being tested. Furthermore, the p-value penalizes complex models with a large number of parameters, likely resulting in a preference for simpler and possibly incomplete models.
2.2 Bayesian Hypothesis Testing
More recently, model comparison/selection using the Bayesian Evidence111The Bayesian Information Criterion (BIC), which is an approximation to the Bayesian Evidence, has also been used in the field to perform model comparison/selection (e.g., Sotzen et al., 2020; Spake et al., 2021; Alderson et al., 2022)(e.g., Benneke & Seager, 2013; Sedaghati et al., 2017; Welbanks & Madhusudhan, 2021a) has become customary due to nested sampling techniques which permit its computation (e.g., Skilling, 2006; Feroz et al., 2009; Buchner et al., 2014). This metric is computed by taking the ratio of evidences (also know as the Bayes’ factor) between two models. Additionally, the difference in evidence between models has been mapped to a commonly used ‘sigma’ scale and introduced to the field of exoplanetary retrievals by Benneke & Seager (2013). Besides its straightforward computation, Bayesian model evidence has the appealing property of penalizing model complexity that is not supported by the data effectively applying Occam’s razor to model selection.
Bayesian metrics allow the incorporation of prior information about a parameter when calculating the probability distribution of interest. Under pathological circumstances, this key feature could be misused by a practitioner. For instance, a different choice of prior, e.g., to one more or less informative, will result in a different Bayesian evidence due to the associated change in prior space, and may lead to a different interpretation regarding the model preference. This presents a challenge in the context of exoplanet atmosphere because commonly, the priors for the model parameters (e.g., chemical abundances, P-T profile parameters, etc.) have somewhat arbitrary boundaries and vary significantly different between studies. For instance, changing the priors on a given chemical species (e.g., H2O) from uniform in log-space (e.g., -1 to -12, see e.g., Welbanks & Madhusudhan, 2019) to a prior motivated by the species’ thermo-chemical equilibrium expectations (e.g., logarithmic from -2 to -4) would significantly reduce the prior space and could result in a very different model evidence and associated interpretation, without necessarily changing any posterior inferences. Therefore, improper use of these Bayesian metrics through e.g., inappropriate prior choices, can propagate through to the Bayes factor, and potentially affect the reported detection significance.
3 Leave-One-Out Cross-Validation
The model assessment metrics described in Section 2 provide a single value shedding little light on exactly where and how each model is failing or performing well. From this single number, it is non-trivial to estimate whether the models would perform better or worse at other wavelengths, or to estimate which parts of the data are driving our inferences. To overcome these challenges, we propose the application of Bayesian Leave-one-out cross-validation to compute the expected log pointwise predictive density (elpd) to aid interpretation of atmospheric retrievals. In the rest of this section we closely follow Vehtari et al. (2017), and outline the theory behind elpd.
Consider an atmospheric model , which is being fit to a spectroscopic data set . Here is the set of spectroscopic data points where, comprised of a transit depth (), and transit depth error () pair, at a wavelength . Then, using Bayes’ theorem the posterior distribution for each of the model parameters () for a given model () is
(1) |
Here, is the likelihood and represents the probability of observing the data () given a specific set of model parameters (). The likelihood is multiplied by the prior distribution and divided by the Bayesian model evidence . A retrieval uses the likelihood and prior to compute estimates for the Bayesian model evidence and posterior distributions for the model’s parameters.
To complement existing model selection and criticism, we introduce the leave-one-out expected log posterior predictive density elpd obtained by Bayesian cross-validation. Bayesian leave-one-out cross-validation is one way to estimate the out of sample predictive accuracy of a model (i.e., the expected log pointwise predictive density, see e.g., Gelman et al., 2014, for a review). Bayesian leave-one-out cross-validation works by fitting a given model to , the full data set () with the th data point, , left out. The posterior distribution of the fit it used to assess how well the left out data point is predicted. This procedure is then repeated for each data point in turn. Ultimately, this allows each data point to be scored on how well it is predicted by a given model conditioned on the rest of the dataset. These data point scores can then be compared pairwise between competing models or totaled over the data set to give an indication of overall model performance. The elpd for is the expected log posterior predictive density,
(2) |
The posterior predictive density can be calculated by taking the expectation of the likelihood of the with respect to the posterior of the model fitted to ,
(3) |
The elpd score quantifies how well the model fit to the data (i.e., trained model under the chosen prior), performs on unseen data where each data point is left out in turn. On the other hand, the Bayesian evidence quantifies the probability of the data agreeing with the specified model under the chosen prior. Therefore, they are fundamentally different quantities that when used together can enrich our understanding of our models and their ability to explain the data.
In practice, if we have samples from the posterior distribution obtained by fitting the model to (), we can estimate the posterior predicted density of with,
(4) |
All data point scores can be totaled over the full data set to give a indication of overall model performance,
(5) |
Two competing model’s overall performance can be computed by taking the difference in elpd score with,
(6) |
Here, a positive would mean that has the better out of sample predictive performance.
In principle, calculating the exact elpd score would require fitting the model to infinite data. Therefore, the assumption here is that the data points used to calculate the elpd score come from a larger population making it possible to calculate their standard error (SE). The SE of the estimated elpd score is
(7) |
while the SE for the difference between two scores is given by,
(8) |
Here, is the variance operator.
Naive computation of all terms for a given model would be computationally expensive. This is because each term requires a full Bayesian refit of the model to obtain posterior samples. Overall, full Bayesian fits would be required. In the case of exoplanet atmospheric modeling, and one Bayesian refit, typically performed with nested sampling, can take Central Processing Unit (CPU) hours (e.g., Zhang et al., 2020). This required computation is clearly prohibitive, therefore, we have to turn to the Pareto Smoothed Importance Sampling (PSIS; Vehtari et al., 2015) approximation outlined in (Vehtari et al., 2017) to calculate 222We used the code available at https://github.com/avehtari/PSIS for the PSIS approximation. The relevant PSIS methods are also available in the Arviz python package (Kumar et al., 2019)..
We defer a detailed explanation of the PSIS approximation to Appendix A.1, but briefly outline the method here following Vehtari et al. (2015) and Vehtari et al. (2017). PSIS allows us to approximately compute all for a model without having to refit the model times. When using the PSIS approximation, the model is fit to the whole data once (e.g., one retrieval) and the posterior samples from this single inference are re-weighted using importance sampling to approximate the effect of leaving each data point out in turn. Instead of raw importance weights being using in the approximation, the distribution of importance weights is fitted with a Pareto distribution (Zhang & Stephens, 2009) and a set of smoothed importance weights is drawn from this distribution and used. This is called PSIS and it serves two purposes. Firstly, PSIS acts to stabilize the importance sampling approximation by removing extreme weights that can dominate and make the importance sampling approximation unreliable (Vehtari et al., 2015). Secondly, the fitted shape parameter of the Pareto distribution, , traces the number of finite moments of the importance weights distribution and acts as diagnostics on the reliability of the importance sampling approximation. For LOO, identifies cases when the posterior distribution changes significantly upon removing a data point and, therefore, the importance sampling approximation would be unreliable and should not be used. Vehtari et al. (2015) determined empirically that PSIS estimates were reliable for . In the case of LOO, the PSIS approximation is used for each data point where . For data points where , the PSIS approximation is likely to be unreliable, so a full refit of the model is performed and used to calculate . Overall, this allows all terms to be computed with less than refits of the model. For brevity, in what follows .
3.1 Limitations and other considerations
Bayesian leave-one-out cross-validation presents an additional tool for model criticism and comparison within the context of atmospheric retrievals. This tool allows us to compare two models on a per-data-point scale, but as with any other model comparison metric, it cannot be used in isolation to argue definitively for or against a detection of a spectral feature in an exoplanet atmosphere. The reliability of an inference can only be robustly established by considering multiple model comparison and criticism tools and metrics.
For instance, does not penalize for model complexity. Therefore, when trying to assess two models with varying degrees of complexity, Bayesian leave-one-out cross-validation could be used alongside metrics that have a built in Occam’s razor such as the Bayesian evidence. Additionally, as is the case for existing analysis using the Bayesian evidence, when the set of possible models is large, model comparison with may result in a combinatorial explosion of possible pairwise combinations.
Additionally, the difference in score (Equation 6) between different pairs of models can be difficult to interpret and has not been commonly mapped to a ‘sigma’ scale as other model comparision metrics have. Moreover, the standard error for the difference between two scores (Equation 8) comes from the consideration that we are estimating the LOO predictive accuracy of a model using a finite sample of data points (see e.g., Vehtari et al., 2017). As such, the standard error estimate may be inaccurate if the number of observations is limited (see e.g., Sivula et al., 2020, for a discussion on the uncertainty in elpd estimates). Therefore, we caution against using the difference of scores in units of standard errors as a proxy for sigma significance, but it can be useful to visualize the relative performance of different models. We further explore this and other future considerations in Section 7.2.
4 Atmospheric Retrieval Setup
First, we demonstrate the use of elpd on synthetic HST STIS and HST WFC3 observations, the common observational setup in the pre-JWST era, to obtain intuition for the information that this metric can provide as well as demonstrate its conformance with prior expectations. Then, we apply this tool to the interpretation of current observations of the transiting hot Jupiter HAT-P-41b (Hartman et al., 2012) obtained with HST WFC3 UVIS G280, HST WFC3 G141, and Spitzer IRAC from Wakeford et al. (2020) and Lewis et al. (2020) spanning the wavelength range 0.2 to 5.0 m. Particularly, we assess the previous detection of H- absorption in the planet’s atmosphere – a result in significant discrepancy with thermo-chemical expectations.
The observations are interpreted using Aurora (Welbanks & Madhusudhan, 2021a) with the atmospheric forward model following the standard prescription in Welbanks & Madhusudhan (2019); Welbanks et al. (2019) and Welbanks & Madhusudhan (2021a). The model atmosphere is divided into 100 layers equally spaced logarithmically from 10-6- bar. The Bayesian inference is performed using the nested sampling algorithm (Skilling, 2004) through MultiNest (Feroz et al., 2009) and the Python package PyMultiNest (Buchner et al., 2014).
4.1 Synthetic Observations
We generate synthetic HST-STIS and HST-WFC3 synthetic observations assuming spectral resolutions and precisions comparable to existing observations (e.g., Sing et al., 2016), generally following the procedure described in Welbanks & Madhusudhan (2021b). First, we generate spectra at a higher resolution of on average, from – m sampled line-by-line, and solving radiative transfer in transmission geometry. The resulting spectra is used to produce a binned spectrum following the methods outlined in Pinhas et al. (2018). The resulting binned spectra has a resolution and precision of and ppm, respectively, for HST-STIS, and and ppm for HST-WFC3. The synthetic observations include Gaussian scatter.
The synthetic observations are generated for a planet with the same system parameters as HD 209458b (e.g., , , and Torres et al., 2008). The model generating the synthetic data assumes a clear, H2-rich, isothermal atmosphere at the equilibrium temperature of the planet , with solar abundances of H2O, Na, and K (i.e., , , , e.g., Asplund et al., 2009). We assume a reference pressure for the planetary radius of 10 bar.
4.2 Atmospheric Model
Our atmospheric model computes radiative transfer in a parallel-plane planet atmosphere in transmission geometry in hydrostatic equilibrium. The chemical absorbers considered in this work are H2O (Rothman et al., 2010), Na (Allard et al., 2019), K (Allard et al., 2016), CrH (Bauschlicher et al., 2001), AlO (Patrascu et al., 2015), VO (McKemmish et al., 2016), bound-free H- (John, 1988), and H2-H2 and H2-He collision induced absorption (CIA; Richard et al., 2012).
First, to test the implementation of elpd, we use a similar model to the one that generated the synthetic observations. Namely, the atmospheric model considers absorption due to H2O, Na, and K, with the volume mixing ratio of each species as a free parameter. The Pressure-Temperature (P-T) profile is assumed to be isothermal, with 1 free parameter for the atmospheric temperature and 1 free parameter for the reference pressure () at the assumed planet radius of . Instead of assuming a clear atmosphere, the model considers the presence of inhomogeneous clouds and hazes following the generalized parameterization in Welbanks & Madhusudhan (2021a) using one fraction for the combined effects of clouds and hazes. In total, this model has 9 free parameters: 3 chemical species, 1 isothermal temperature, 1 reference pressure, and 4 parameters for inhomogeneous clouds and hazes.
Second, we implement the same atmospheric model used by Lewis et al. (2020) to interpret the HAT-P-41b observations and used to infer a detection of H- (i.e., their “minimal” model setup). This atmospheric model has an isothermal, clear atmosphere with absorption due to H2O, Na, CrH, AlO, VO, and H-. Following Lewis et al. (2020), we retrieve the planetary radius for a reference pressure at 10 bar. In total, this model has 8 parameters: 6 chemical species, 1 for the isothermal temperature of the atmosphere, and 1 for . We follow the retrieval setup from Lewis et al. (2020) to enable a direct comparison with their results suggesting the presence of significant H- opacity in the atmosphere of HAT-P-41b.
5 Interrogating a known model

Before applying elpd to real spectroscopic data, we first build intuition by computing this metric on the synthetic data described in Section 4.1. In this case we know the true underlying model that generated the data. By knowing the true model and the physical properties it represents, we can sense check the elpd score in the context of exoplanet atmospheric models, and build intuition on how to apply it to real data sets. In this section we use the atmospheric model explained in Section 4.2 considering 3 chemical absorbers, an isothermal atmosphere, an the possibility of inhomogeneous clouds and hazes. In what follows we compute the elpd score using the PSIS approximation and found it to be accurate and significantly faster than the naive method of computing elpd saving CPU hours of computation (see Appendix A.2).
As commonly performed in the exoplanetary atmospheric retrieval literature, we compare a ‘reference model’ with the highest degree of complexity, to a series of less complex models for which one or more parameters, or model components, have been removed. Here, our ‘reference model’ is described in Section 4.2 and is similar to the model that generated the synthetic observations. Namely, our reference model considers absorption due to H2O, Na, K, an isothermal P-T profile, and the possibility of inhomogeneous clouds and hazes. Then, we compare this reference model to four other models: 1) same as reference model but without H2O absorption, 2) same as reference model but without Na absorption, 3) same as reference model but without K absorption, and 4) same as reference model but without the presence of inhomogenous clouds and hazes.
Performing the model comparison using the difference of the Bayesian evidence and converting it to a ‘detection significance’ as commonly performed in the literature (e.g., Benneke & Seager, 2013; MacDonald & Madhusudhan, 2017; Welbanks & Madhusudhan, 2021a) results in a model preference for H2O of , Na of , K of , while the inclusion of inhomogeneous clouds in the model is not preferred at a level. This comparison of the Bayesian evidence supports the known input of the model generating the data: the presence of H2O, Na, and K. Additionally, the Bayesian evidence comparison indicates that the additional model complexity associated with including the presence of inhomogeneous clouds and hazes is not supported by the data.
Despite the three chemical components being present in the true model generating the data, their associated ‘detection significances’ are substantially different. This may be due to the fact that a large number of data points cover the broad H2O absorption features while the main features due to Na and K are only encompassed by a few spectral bins. However, this information is not provided by the model comparison metric by itself. Additionally, the interpretation of a ‘detection significance’ higher than (i.e., a probability of 1-7, e.g., MacDonald & Madhusudhan, 2019) is difficult as it represents the relative preference between two models which may be equally incorrect. Finally, the difference in the Bayesian evidence does not provide any information at the data-point (or spectral feature) resolution about the strengths or deficiencies of the model.

Detailed information about the model performance at the per-data-point level can be teased out using the elpd metric. Figure 1 shows the retrieved median model and the confidence intervals of the retrieval using the reference model on the synthetic observations. The synthetic observations have been color coded by their elpd score. The darker blue spectral points in Figure 1 have a lower elpd score than the bright yellow points, and are comparatively worse predicted by the atmospheric model, when left out of the fit. For instance, the elpd score suggests that the spectral point at , which has the lowest elpd score of , is the worst explained by the model as shown by the data fit, where it is far away from the model.
The primary utility of elpd lies in model comparison. As such, Figure 2 shows the per-data-point difference in elpd score (elpd score) between the reference model and the less complex models used333The elpd scores for all models are calculated using the PSIS approximation. The Pareto values for all data points for all models are indicating that the approximation is reliable.. Additionally, Figure 2 shows the retrieved confidence interval for each of the models compared. The top left panel, showing the comparison of the models including and not including H2O absorption, has the highest difference in elpd scores. The magnitude of the difference is then followed by the models comparing the presence of Na, then K, and finally the presence of inhomogeneous clouds and hazes.

In agreement with intuition, the top left panel of Figure 2 shows that the spectral signatures of H2O in the near-infrared are responsible for the better performance of our model in the HST-WFC3 region () of our synthetic observations. Similarly, the top right and bottom left panels of Figure 2 show that the models with Na and K absorption improve the model performance near m and m, respectively. Here, and unlike the H2O case for which the largest difference in elpd scores closely follow the influence of the broadband molecular features, the largest difference in elpd score follows the points near the sharp Na and K doublets. Finally, the presence of inhomogeneous clouds and hazes do not improve the model performance significantly and have the smallest difference in elpd scores per data point of all models considered.
The difference in elpd scores as shown in Figure 2 allows us to diagnose the performance of the models on a per-data-point level and understand the regions of the data where our model may be under-performing relative to other models. A smaller difference in elpd scores indicates a similar performance between the models considered. Additionally, this metric allows us to infer which subset of the data are responsible for the preference of one model over the other, e.g., infrared HST-WFC3 observations are responsible for the preference of models with H2O absorption.
Model | DOF | p-value | Log Evidence | DS | BIC | elpd Score | elpd Score (SE) | |
---|---|---|---|---|---|---|---|---|
Reference | Ref. | Ref. | ||||||
No H2O | ||||||||
No Na | ||||||||
No K | ||||||||
Clear atmosphere | N/A |
Note. — DS means Detection Significance. The DS is defined for a pair of models with a ratio of evidences (e.g., Bayes factor) (e.g., Trotta, 2008). N/A means that the DS is not defined since the reference model has a smaller evidence than the model being tested, e.g., B1. The clear atmosphere model is preferred over the reference model at . The p-value for the model without H2O is too small and tends to zero.
Finally, we can compare the difference in elpd scores between two or more pairs of models. Figure 3 shows the total difference in elpd scores between the reference model and the less complex models in units of their standard error (i.e., Equations 6 normalized by Equation 8). Figure 3 shows that the model component responsible for the largest increase in the predictive performance of our reference model is H2O absorption. Relative to no improvement in model performance (i.e., a elpd difference of zero), the inclusion of H2O absorption improves the out of sample predictive performance of the model by standard errors. This improvement is followed by Na ( standard errors) and K ( standard errors). On the other hand, the inclusion of inhomogeneous clouds and hazes decreased the predictive performance of the model at standard errors.
Table 1 summarizes these results alongside other common model comparison metrics for our analysis of these synthetic observations. Despite the similarities between the Log Evidence and the elpd score, or the elpd score divided by its standard error and the ‘detection significance’ in –units, we caution against interpreting them as interchangeable quantities as explained in Section 3. Importantly, we compute the standard error (i.e., the uncertainty) of the reported elpd score, unlike the detection significance which is rarely quantified by its numerical uncertainty444By numerical uncertainty, we mean the uncertainty in the model evidence due to the algorithm used to estimate this quantity, e.g., the log evidence uncertainty as determined by the nested sampling output. Accounting for these numerical uncertainties can change the reported detection significance by up to in some cases.(see Section 3).
The information provided by the elpd metric for this atmospheric model for which we know the true input agrees with our expectations. Using this metric enables us to identify deficiencies in the models as a function of wavelength, e.g., the lack of H2O absorption fails to explain the observations in the near-infrared. Additionally, it provides us a way to quantify the improvement in our models by quantifying the out of sample predictive performance of each model and comparing them. This demonstration on a well understood atmospheric scenario showcases the power of this metric to provide us information at the per-data-point level and highlight the strengths and deficiencies of our models. Besides its application to well understood atmospheric scenarios, elpd can help us diagnose the impact of less understood model considerations such as planet inhomogeneities (e.g., clouds, hazes, and multi-dimensional effects), chemical detections of species with unclear spectroscopic features (e.g., metal hydrides in the optical, or H- opacity), and the effects of instrument systematics (e.g., correlated and non-Gaussian noise).
6 The case of H- in HAT-P-41b
Having built intuition for the elpd metric, we proceed to implement it to the transmission spectrum of the hot Jupiter HAT-P-41b spanning the UV to infrared wavelengths from 0.2-5.0 m. HAT-P-41b is the first hot Jupiter for which significant H- absorption has been invoked as the most robust explanation for its transmission spectrum by Lewis et al. (2020). With a retrieved H- abundance several orders of magnitude larger than thermochemical equilibrium expectations (e.g., Kitzmann et al., 2018; Parmentier et al., 2018), understanding the robustness and reliability of this inference is paramount. By implementing the elpd metric to the same data and atmospheric setup used to infer the presence of H-, we seek to identify which data points drive this possible detection.
We perform a retrieval following the “minimal” model in Lewis et al. (2020), as explained in Section 4.2, as this is the setup for which H- was inferred at a level. We perform the retrieval on the combination of HST WFC3 G141, Spitzer IRAC observations and the HST UVIS G280 data from the jitter decorrelation or systematic marginalization treatments in Wakeford et al. (2020). We compare our retrieved H- abundances and isothermal temperatures to those from Lewis et al. (2020) for the same data and model configurations. Furthermore, we compare our derived and difference in Bayesian evidence, translated to a ‘detection significance’ using the formalism in Benneke & Seager (2013) as explained in Welbanks & Madhusudhan (2021a), to those reported in Lewis et al. (2020).
When using the UVIS jitter decorrelation data, our retrieval does not find evidence of H- absorption (). Our derived ‘detection significance’ of 1.6 is lower by more than to the reported value of 2.9 by Lewis et al. (2020). Nonetheless, the retrieved and isothermal temperature of T K are consistent with the derived properties by Lewis et al. (2020). Additionally, our is very close to in Lewis et al. (2020). The agreement in retrieved properties and alongside a disagreement in associated ‘detection significance’ may be the result of dissimilar priors and subtle model assumptions between our work and those employed in Lewis et al. (2020).
On the other hand, when considering the systematic marginalization HST UVIS G280 observations our retrieval finds a 2.7 model preference for H- absorption, comparable to the 2.6 for the same model and data from Lewis et al. (2020). This configuration retrieves for an isothermal temperature of T K consistent with Lewis et al. (2020). Likewise, the derived is comparable to the value of from Lewis et al. (2020). This configuration reproduces the weak-to-moderate evidence of H- in HAT-P-41b.
With the reproduction of the weak-to-moderate detection of H- in HAT-P-41b in mind, we proceed to calculate the approximated PSIS elpd values, as explained in Section 3, to quantify out of sample predictive performance for the reference model relative to the model without H- absorption. Figure 4 shows the retrieved transmission spectrum and confidence intervals for this retrieval, with the data points color coded by their score (Equation 6) between the reference model including H- absorption and the model without H- absorption. Points with a large positive value (redder points) are better explained by the reference model with H- (i.e., the reference model has a better out of sample predictive performance), while the large negative values (bluer points) are better explained by the model without H- absorption (i.e., the reference model has a worse out of sample predictive performance).

The data point with the largest in Figure 4 is the Spitzer IRAC point at 3.6 m, while the point with the lowest is at m from HST WFC3 G141. Of the four points with the largest , two are from the HST WFC3 G141 observations (i.e., points at m and m) covering wavelengths at which the spectral contribution of H- may be expected (see e.g., Figure 7 in Lewis et al., 2020), while two fall outside of this spectral range (i.e., points at m and m).
The Spitzer IRAC point at 3.6 m, highlighted by an arrow in Figure 4, is the point with highest score. To assess the robustness of the H- detection in HAT-P-41b, we perform an additional retrieval with the same “minimal” model as before but without the data point with the highest . We find that without the Spitzer IRAC point at 3.6 m the reference model with H- absorption is not preferred () over the model without H-. This means that the 2.7 ‘detection significance’ previously quoted depends on the transit depth at 3.6 m being reliable. Unsurprisingly, if both Spitzer photometric points are removed from the retrieval (e.g., only HST UVIS G280 and HST WFC3 G141 observations are considered) we do not find a preference for H- absorption.
When the Spitzer IRAC point at 3.6 m is not considered in the retrieval, the retrieved abundance of H- is and the retrieved isothermal temperature is T K. Additionally, the derived is comparable to that of Lewis et al. (2020). Overall, while the inferred properties of HAT-P-41b remain largely consistent, removing this single transit depth results in the non-detection of H- absorption in the atmosphere of HAT-P-41b using current observations.
Besides the presence of H-, absorption due to H2O has been previously inferred or detected in HAT-P-41b (e.g., Tsiaras et al., 2018; Fisher & Heng, 2018; Lewis et al., 2020; Sheppard et al., 2021). Specifically, Lewis et al. (2020) find a detection of H2O for their HST WFC3/UVIS, WFC3 G141, and Spitzer IRAC observations, considered in this work, although the model configuration associated with this detection is unclear. For the same set of observations including the systematic marginalization HST UVIS G280, and the “minimal” model considered above, our retrieval finds a model preference for H2O absorption at , similar to that reported by Lewis et al. (2020).
Figure 7 in the Appendix shows the for the reference model including H2O absorption and the model without H2O. Following our intuition, the points with the largest positive scores, i.e., those better explained by the presence of H2O absorption, fall within the wavelength range of the HST WFC3 G141 observations (m) where H2O has a prominent absorption feature. The point at m has the largest score, almost the highest score for the H- model comparison.

We sum all scores over the full data set and calculate the standard error of this difference (Equation 8) to quantify the overall performance of each of the models considered. Figure 5 shows the overall score for each of the models considered divided by their calculated standard error. As before, a larger and positive means that adding the model component, e.g., H2O or H- absorption, increased the predictive performance of the model. We find that including H2O absorption increases the predictive performance of the model by 1.7 standard errors, while including H- absorption increases the predictive performance by 1.0 standard error. This for H- suggests that H- absorption does not significantly improve the predictive performance of the model.
Overall, implementing elpd as a model performance metric on retrievals of transmission spectrum of HAT-P-41b provides complementary per-data-point information, enabling further scrutiny on the possible detection of H- in the atmosphere of the planet. Particularly, elpd indicates that the weak/moderate () detection of H- is contingent on the transit depth at a single data point (i.e., 3.6 m). Additionally, the inclusion of H- on the atmospheric models does not significantly improve the predictive performance of the model. Complementing retrieval studies with elpd analysis, as performed here for the UV to IR spectrum of HAT-P-41b, can provide key information to investigate the robustness of our findings, understand the limits of the data, and contextualize our inferences by the observations.
7 Summary and Discussion
Understanding the limitations of our data and our models has become a key need for the robust and reliable interpretation of spectroscopic observations of exoplanet atmospheres. Particularly, as the resolution and precision of our observations improves in the upcoming decade, and atmospheric models include previously neglected physical effects, there is the need for diagnostic metrics that can provide insights into strength and weakness of a model. To that effect we have incorporated elpd to exoplanet atmospheric retrievals.
Unlike conventional model assessment metrics employed in atmospheric retrievals such as the Bayesian evidence or , elpd provides an assessment of the model performance at the per-data-point level as well as for the model over the entire data set. At its core, elpd provides an estimate for the ability of a model to predict (i.e., explain) each left out data point. By understanding which data points are better explained by each of the models considered, we can quantify the dependency of our inferences on the observations obtained. In other words, elpd provides the tools to contextualize the atmospheric inferences and detections as a function of the observations. We briefly summarize the main advantages of this tool:
-
•
elpd allows each data point to be scored on how well it is predicted by a given model when it is left out of the model fit. This gives us information about the model performance at the per-data-point level.
-
•
The elpd score can also be totaled over the entire dataset, giving an estimate of the overall out of sample predictive performance of the model.
-
•
The elpd score between different models is an additional metric to assess the need for using more complex models over simpler models with individual components removed.
-
•
elpd helps assess the sensitivity of resulting inferences on specific data points in the dataset.
As with any other model comparison metric, elpd has drawbacks and areas that need to be further explored within the context of atmospheric retrievals. These include:
-
•
elpd as a metric does not penalize model complexity. Therefore, when assessing models of increasing complexity, elpd should be used alongside metrics like the Bayesian evidence which do penalize model complexity.
-
•
elpd score comparison may result in a large number of pairs of models being examined. This could lead, in principle, to a combinatorial explosion of possible pairwise model comparisons. This problem is similar to those of other metrics such as the Bayesian evidence.
-
•
elpd should be used alongside existing model assessment and comparison metrics. Together these metrics can provide a holistic picture of the reliability of the data-model interpretation.
7.1 Is there H- in the Atmosphere of HAT-P-41b?
As we approach the era of JWST observations, covering wavelengths never probed before with unprecedented precision, it is paramount to determine what constitutes a robust and reliable detection. To date, there is still debate as of what constitutes a clear atmospheric signature, especially for chemical species or atmospheric processes with uncertain spectroscopic signatures. For example, a source of continuum opacity such as H- does not provide strong spectral features as other prevalent species in exoplanet atmospheres such as H2O, Na, or K. Therefore, previous studies have resorted to calculating the difference in Bayesian evidence for models with and without a model component to quantify its presence, a common practice in the field.
Indeed, as explained in Section 6, Lewis et al. (2020) report the tantalizing need for significant H- absorption in the atmosphere of HAT-P-41b to best explain its transmission spectrum. If confirmed, this would suggest strong chemical disequilibrium in the atmosphere of the planet and would imply the need for new atmospheric models capable of capturing more complex physical effects such as photochemical processes. On the other hand, if this suggestion is invalidated this would highlight the challenges in understanding the limitations of our models and our data. Below we contextualize the additional information obtained by using elpd on the transmission spectrum of HAT-P-41b.
Our retrieval analysis in Section 6 can reproduce the weak-to-moderate detection () of H- in HAT-P-41b using the same “minimal” model and the HST WFC3/UVIS G280 with systematic marginalization, HST WFC3 G141, and Spitzer IRAC observations as Lewis et al. (2020), by means of Bayesian evidence comparison. However, when using the jitter decorrelation HST WFC3/UVIS G280 and the same “minimal” model we obtain a non-significant () preference for H- absorption. Our analysis suggests that the presence of H- is only supported by the combination using the systematic marginalization data.
Our elpd analysis on the retrieval configuration that supports the results from Lewis et al. (2020) provides two main insights. First, the best explained data point by the model including H- absorption is at a wavelength at which signatures of H- contribution are not strong. If the data point with the highest score is removed from the retrieval analysis, the presence of H- absorption is no longer preferred by our models. Second, the increase in predictive performance of the models with H-, i.e., how much better does the model with H- explain the observations relative to the model without H-, is not significant ( SE). In comparison the presence of H2O absorption, the other species inferred/detected by several studies (e.g., Tsiaras et al., 2018; Fisher & Heng, 2018; Lewis et al., 2020; Sheppard et al., 2021), is detected at , better explains the observations at wavelengths where H2O absorption is expected, and increases the predictive performance of the models more significantly ( SE).
The sensitivity of the H- detection to the Spitzer IRAC point at 3.6 m, may be the result of changes to the continuum level of the spectrum when H- absorption is included or not in the model. When included, the bound-free H- absorption provides a continuum in the optical and infared wavelengths (i.e.,m) that can help explain the HST UVIS and HST WFC3 G141 observations, but due to its decline in absorption at longer wavelengths allows for a lower transit depth in the 3.6 m datapoint. This opacity drop in the mid-infrared is not present when species other than H-, in combination with other model components (e.g., retrieved radius at 10 bar), are used to explain the HST UVIS and HST WFC3 G141 observations, resulting in a slightly higher transit depth at 3.6 m. The later is less compatible with the Spitzer IRAC 3.6 m observation. This level of model criticism is achievable for the first time thanks to the implementation of Bayesian leave-one-out cross validation.
Therefore, our analysis suggests that H- is weakly detected in the atmosphere of HAT-P-41b if systematic marginalization treatment of the HST WFC3/UVIS G280 is correct, and the transit depth of the Spitzer IRAC photometric point at 3.6 m is precise and accurate. Existing and future observations at these and complementary wavelengths can help verify this inference. For example, the retrieval analysis of Sheppard et al. (2021) using HST STIS observations (m) at similar wavelengths does not require significant H- absorption to explain the transmission spectrum of HAT-P-41b. Future studies could perform a consistent reanalysis of these HST STIS, HST WFC3/UVIS, HST WFC3 G141, and Spitzer IRAC observations to better assess the need for H- absorption in HAT-P-41b. Finally, future JWST observations with NIRSPEC or NIRCAM could provide important insights in the near-IR where the interaction of electrons and neutral H, not considered by Lewis et al. (2020) or this work, can have significant contributions to the observed spectrum.
7.2 Future Considerations
For the work in the paper, including the PSIS elpd approximation, we have operated under the assumption of additive white Gaussian noise for all models. Recent work by Ih & Kempton (2021) suggests that the presence of more complex and correlated noise of instrument or stellar origin could impact atmospheric parameter estimation. The methods in the paper can be generalized to general correlated Gaussian noise models (Sundararajan & Keerthi, 2001; Bürkner et al., 2020). A future direction of research would be to investigate correlated Gaussian noise models with PSIS-elpd using the Gaussian Process (Rasmussen & Williams, 2005; Ambikasaran et al., 2015; Foreman-Mackey et al., 2017; Foreman-Mackey, 2018) noise model implemented in Aurora (Welbanks & Madhusudhan, 2021a).
Further cross-validation schemes specific to exoplanet atmospheric data sets could also be developed in the future (see e.g., Bürkner et al., 2020, for an example of a cross-validation scheme appropriate for time-series data). In these schemes, instead of leaving out one data point all the data points coming from a single instrument could be left out. This particular cross-validation scheme could be used to diagnose systematic instrument effects. Moreover, a cross-validation scheme leaving out several data points clustered around a spectral feature could be explored to differentiate between localized signatures of specific chemical species, and non-localized effects like clouds and hazes.
Bayesian leave-one-out cross validation can help us determine the observational requirements (e.g., resolution, wavelength coverage, signal-to-noise) required for upcoming observational campaigns (e.g., upcoming HST and JWST GO Cycles) and the next generation of large mission concept studies and probe missions (e.g., National Academies of Sciences & Medicine, 2021). Particularly, computation of the elpd score using simulated observations can help us understand which observations will be decisive in detecting different atmospheric processes. Furthermore, PSIS (see Section 3 and Appendix A.1) can help us understand the sensitivity of the derived posterior distributions (e.g., the derived atmospheric abundances) to specific observations.
Finally, although we have demonstrated elpd on low-resolution transmission spectra, the methods introduced here are readily applicable to emission spectra and high-resolution studies. For instance, future studies may investigate the robustness of high-resolution inferences to individual lines or parts of the spectrum. Furthermore, future studies can use elpd to assess the need for additional observations in order to better constrain or confirm previous atmospheric inferences.
7.3 Concluding Remarks
We are about to embark into journey of discovery. The imminent high-precision infrared data from JWST (Greene et al., 2016), is bound to revolutionize our understanding of exoplanet atmospheres. These long-awaited observations may help us infer previously unseen chemical species, directly ascertain the 3D properties of exoplanets, and deduce disequilibrium processes that will build towards our search for life in other worlds. These radical discoveries can only be trusted if we acquire the ability to reliably extract the atmospheric properties from an observed spectrum with a thorough understanding of the limits of the data and models. In this context, elpd opens our eyes to a new dimension where we can clearly identify the specific data points upon which our inferences hinge. This new tool helps us revisit our data analysis and check for its robustness, identify the limitations of our models, and plan for the observations required to confirm our discoveries. The diverse and complementary set of tools to assess the robustness and reliability of our discoveries will help us navigate this exploration into the unknown.
Appendix A Pareto Smoothed Importance Sampling Approximation
A.1 Method
Due to the significant computational cost required to calculate naively for exoplanet atmospheric models and data sets, we use the Pareto Smoothed Importance Sampling (PSIS; Vehtari et al., 2015) approximation outlined in (Vehtari et al., 2017). The following sections closely follow Vehtari et al. (2015) and Vehtari et al. (2017).
Generally, the PSIS approximation works by instead of refitting the model times, the model is fit to the whole data set and then the posterior samples are re-weighted via importance sampling to estimate the effect of leaving each data point out. The model is only refit when the PSIS approximation is deemed poor. The procedure allows to be estimated with significantly less refits of the model and makes computing elpd feasible.
In general, importance sampling is a Monte Carlo method that allows the expectation of some hard to sample distribution () to be computed,
(A1) |
Since is hard to sample from, we wish to use samples from a different distribution () that is easy to sample from, or has samples readily available, to evaluate Equation A1. Providing the ratio of and is known up to some normalization for the samples from (), Equation A1 can be estimated with importance sampling as,
(A2) |
Here, is the ratio of and up to some normalization constant, and the samples from are being re-weighted by the importance weights .
The standard importance sampling approximation in Equation A1, however, is typically unstable due to extreme values of the importance weights dominating the approximation (Vehtari et al., 2015). To remedy this problem PSIS is used instead of standard importance sampling. In PSIS, the raw importance weights () in Equation A1 are replaced with a set of “smoothed” weights () which stabilize the approximation. The smoothed weights are calculated by fitting a Pareto distribution (Zhang & Stephens, 2009) to the tail of the raw weights and replacing the extreme raw weights with draws from the fitted Pareto distribution.
In addition to stabilizing the importance sampling approximation, PSIS provides a diagnostic on how well the importance sampling approximation will work. Essentially PSIS fails loudly. When the posterior distribution from fitting the full dataset is not a good proposal distribution (i.e., when the posterior distribution changes significantly upon removing the data point in question) the importance sampling will fail. The loud-failing diagnostic is the value of the shape parameter of the fitted Pareto distribution (Pareto ). The Pareto value traces the number of finite moments of the raw weights distribution and therefore the quality of the importance sampling approximation. Vehtari et al. (2015) empirically find that if , the PSIS approximation is likely to be reliable, whereas indicates the PSIS approximation is unreliable and should not be used.
To compute elpd approximately and quickly, PSIS is used in the following way (Vehtari et al., 2017). First the model is fit to the full data set. We then have samples readily available from the posterior distribution where in Equation A2. We wish to re-weight these samples using PSIS to approximate Equation 3. To use PSIS, we see by comparing Equations 3 and A2 that we need to compute the ratio,
(A3) | ||||
(A4) |
Here, the proportionality arises from Bayes theorem which states that the posterior equals the prior likelihood up to a normalization constant (i.e., the evidence). All models in this work have a white Gaussian noise model which factorizes over the data points which means the ratio simplifies to,
(A5) |
Putting everything together and denoting as one of samples from the full posterior we have,
(A6) |
For each of the left of data points the raw weights are calculated via Equation A5 and are then smoothed. If we deem PSIS reliable and use it to calculate via Equation A6. For each data point where smoothing results in we do not use the PSIS approximation. Instead we perform a full Baysian refit with the data point left out and evaluate exactly via Equation 4. Overall, this procedure allows the computation of all terms with only a small number of expensive refits of the model.
A.2 Validating the PSIS Approximation

In this section we assess the accuracy of the PSIS approximation for an exoplanet atmospheric model. Specifically, we use the atmospheric model explained in Section 4.2 considering 3 chemical absorbers, an isothermal atmosphere, an the possibility of inhomogeneous clouds and hazes. First we calculate the elpd score exactly by performing independent retrievals in which each individual spectroscopic data point is removed in turn and remains ‘unseen’ (i.e., refits of the model or retrievals). The resulting exact terms are evaluated for each data point using Equation 4. Then, we calculate the same terms using the PSIS approximation (Equation A6) The resulting approximated PSIS and exact are shown Figure 6, along with the Pareto values from the PSIS approximation.
Figure 6 shows the approximated PSIS elpd scores and exact elpd scores lie along a unity line and are in agreement with one another. The middle panel of Figure 6 shows the Pareto values for the approximation, with all of them below the empirical value of . As a result, the PSIS approximation only required one refit of the model compared to the refits required to compute all the elpd terms exactly, and still provided accurate results. Overall, the PSIS approximation was accurate and saved CPU hours of computation in this case, i.e., the PSIS approximation took of the computation of the exact calculation. For completeness, we include in the right panel of Figure 6 the Pareto values for the approximations employed in Section 6.
Appendix B Auxiliary retrieval information
Table 2 shows the priors used for the retrievals in Sections 5 and 6. Figure 7 shows the retrieved transmission spectrum of HAT-P-41b on the HST UVIS/G280 systematic marginalization, HST WFC3 G141, and Spitzer IRAC observations from Wakeford et al. (2020) and Lewis et al. (2020). The data points are color coded by their score between the reference “minimal” model with H2O absorption and the model without H2O absorption. A larger positive value (redder point) indicates that the point is better explained by the presence of H2O absorption (i.e., better out of sample predictive performance by the reference model) while a larger negative value (bluer point) indicates that the datum is better explained by the model without H2O absorption (i.e., worse out of sample predictive performance by the reference model). As expected, the points with the largest score lie within the HST WFC3 observations (e.g., m) where H2O has a prominent absorption feature.
Parameter | Prior Distribution | Synthetic Observations | HAT-P-41b Minimal Model |
---|---|---|---|
Log-uniform | – | – | |
Uniform | – K | – K | |
Uniform | N/A | – | |
Log-uniform | – bar | N/A | |
Log-uniform | – bar | N/A | |
Log-uniform | – | N/A | |
Uniform | – | N/A | |
Uniform | – | N/A |
Note. — N/A means that the parameter was not considered in the model by construction.

References
- Alderson et al. (2022) Alderson, L., Wakeford, H. R., MacDonald, R. J., et al. 2022, MNRAS, 512, 4185, doi: 10.1093/mnras/stac661
- Allard et al. (2016) Allard, N. F., Spiegelman, F., & Kielkopf, J. F. 2016, A&A, 589, A21, doi: 10.1051/0004-6361/201628270
- Allard et al. (2019) Allard, N. F., Spiegelman, F., Leininger, T., & Molliere, P. 2019, A&A, 628, A120, doi: 10.1051/0004-6361/201935593
- Ambikasaran et al. (2015) Ambikasaran, S., Foreman-Mackey, D., Greengard, L., Hogg, D. W., & O’Neil, M. 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38, 252, doi: 10.1109/TPAMI.2015.2448083
- Andrae et al. (2010) Andrae, R., Schulze-Hartung, T., & Melchior, P. 2010, arXiv e-prints, arXiv:1012.3754. https://arxiv.org/abs/1012.3754
- Asplund et al. (2009) Asplund, M., Grevesse, N., Sauval, A. J., & Scott, P. 2009, ARA&A, 47, 481, doi: 10.1146/annurev.astro.46.060407.145222
- Barstow et al. (2017) Barstow, J. K., Aigrain, S., Irwin, P. G. J., & Sing, D. K. 2017, ApJ, 834, 50, doi: 10.3847/1538-4357/834/1/50
- Batalha & Line (2017) Batalha, N. E., & Line, M. R. 2017, AJ, 153, 151, doi: 10.3847/1538-3881/aa5faa
- Bauschlicher et al. (2001) Bauschlicher, C. W., Ram, R. S., Bernath, P. F., Parsons, C. G., & Galehouse, D. 2001, J. Chem. Phys., 115, 1312, doi: 10.1063/1.1377892
- Benneke & Seager (2013) Benneke, B., & Seager, S. 2013, ApJ, 778, 153, doi: 10.1088/0004-637X/778/2/153
- Benneke et al. (2019) Benneke, B., Knutson, H. A., Lothringer, J., et al. 2019, \natas, 3, 813, doi: 10.1038/s41550-019-0800-5
- Buchner et al. (2014) Buchner, J., Georgakakis, A., Nandra, K., et al. 2014, A&A, 564, A125, doi: 10.1051/0004-6361/201322971
- Bürkner et al. (2020) Bürkner, P.-C., Gabry, J., & Vehtari, A. 2020, Computational Statistics, 1
- Charbonneau et al. (2002) Charbonneau, D., Brown, T. M., Noyes, R. W., & Gilliland, R. L. 2002, ApJ, 568, 377, doi: 10.1086/338770
- Chen et al. (2017) Chen, G., Guenther, E. W., Pallé, E., et al. 2017, A&A, 600, A138, doi: 10.1051/0004-6361/201630228
- Chen et al. (2018) Chen, G., Pallé, E., Welbanks, L., et al. 2018, A&A, 616, A145, doi: 10.1051/0004-6361/201833033
- Colón et al. (2020) Colón, K. D., Kreidberg, L., Welbanks, L., et al. 2020, AJ, 160, 280, doi: 10.3847/1538-3881/abc1e9
- Deming et al. (2013) Deming, D., Wilkins, A., McCullough, P., et al. 2013, ApJ, 774, 95, doi: 10.1088/0004-637X/774/2/95
- Désert et al. (2011) Désert, J. M., Sing, D., Vidal-Madjar, A., et al. 2011, A&A, 526, A12, doi: 10.1051/0004-6361/200913093
- Espinoza et al. (2019) Espinoza, N., Rackham, B. V., Jordán, A., et al. 2019, MNRAS, 482, 2065, doi: 10.1093/mnras/sty2691
- Feroz et al. (2009) Feroz, F., Hobson, M. P., & Bridges, M. 2009, MNRAS, 398, 1601, doi: 10.1111/j.1365-2966.2009.14548.x
- Feroz et al. (2013) Feroz, F., Hobson, M. P., Cameron, E., & Pettitt, A. N. 2013, arXiv e-prints, arXiv:1306.2144. https://arxiv.org/abs/1306.2144
- Fisher & Heng (2018) Fisher, C., & Heng, K. 2018, MNRAS, 481, 4698, doi: 10.1093/mnras/sty2550
- Foreman-Mackey (2018) Foreman-Mackey, D. 2018, Research Notes of the American Astronomical Society, 2, 31, doi: 10.3847/2515-5172/aaaf6c
- Foreman-Mackey et al. (2017) Foreman-Mackey, D., Agol, E., Ambikasaran, S., & Angus, R. 2017, AJ, 154, 220, doi: 10.3847/1538-3881/aa9332
- Gelman et al. (2014) Gelman, A., Hwang, J., & Vehtari, A. 2014, Statistics and computing, 24, 997
- Greene et al. (2016) Greene, T. P., Line, M. R., Montero, C., et al. 2016, ApJ, 817, 17, doi: 10.3847/0004-637X/817/1/17
- Harris et al. (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357, doi: 10.1038/s41586-020-2649-2
- Hartman et al. (2012) Hartman, J. D., Bakos, G. Á., Béky, B., et al. 2012, AJ, 144, 139, doi: 10.1088/0004-6256/144/5/139
- Hunter (2007) Hunter, J. D. 2007, Computing in Science & Engineering, 9, 90, doi: 10.1109/MCSE.2007.55
- Ih & Kempton (2021) Ih, J., & Kempton, E. M. R. 2021, AJ, 162, 237, doi: 10.3847/1538-3881/ac173b
- John (1988) John, T. L. 1988, A&A, 193, 189
- Kitzmann et al. (2018) Kitzmann, D., Heng, K., Rimmer, P. B., et al. 2018, ApJ, 863, 183, doi: 10.3847/1538-4357/aace5a
- Kreidberg et al. (2014a) Kreidberg, L., Bean, J. L., Désert, J.-M., et al. 2014a, Nature, 505, 69, doi: 10.1038/nature12888
- Kreidberg et al. (2014b) —. 2014b, ApJ, 793, L27, doi: 10.1088/2041-8205/793/2/L27
- Kumar et al. (2019) Kumar, R., Carroll, C., Hartikainen, A., & Martin, O. 2019, Journal of Open Source Software, 4, 1143, doi: 10.21105/joss.01143
- Lewis et al. (2020) Lewis, N. K., Wakeford, H. R., MacDonald, R. J., et al. 2020, ApJ, 902, L19, doi: 10.3847/2041-8213/abb77f
- Line et al. (2012) Line, M. R., Zhang, X., Vasisht, G., et al. 2012, ApJ, 749, 93, doi: 10.1088/0004-637X/749/1/93
- MacDonald & Madhusudhan (2017) MacDonald, R. J., & Madhusudhan, N. 2017, MNRAS, 469, 1979, doi: 10.1093/mnras/stx804
- MacDonald & Madhusudhan (2019) —. 2019, MNRAS, 486, 1292, doi: 10.1093/mnras/stz789
- Madhusudhan (2018) Madhusudhan, N. 2018, in Handbook of Exoplanets, ed. H. J. Deeg & J. A. Belmonte (Springer), 104, doi: 10.1007/978-3-319-55333-7_104
- Madhusudhan (2019) Madhusudhan, N. 2019, ARA&A, 57, 617, doi: 10.1146/annurev-astro-081817-051846
- Madhusudhan et al. (2014) Madhusudhan, N., Crouzet, N., McCullough, P. R., Deming, D., & Hedges, C. 2014, ApJ, 791, L9, doi: 10.1088/2041-8205/791/1/L9
- Madhusudhan & Seager (2009) Madhusudhan, N., & Seager, S. 2009, ApJ, 707, 24, doi: 10.1088/0004-637X/707/1/24
- McCullough et al. (2014) McCullough, P. R., Crouzet, N., Deming, D., & Madhusudhan, N. 2014, ApJ, 791, 55, doi: 10.1088/0004-637X/791/1/55
- McGill et al. (2022) McGill, P., Anderson, J., Casertano, S., et al. 2022, arXiv e-prints, arXiv:2206.01814. https://arxiv.org/abs/2206.01814
- McKemmish et al. (2016) McKemmish, L. K., Yurchenko, S. N., & Tennyson, J. 2016, MNRAS, 463, 771, doi: 10.1093/mnras/stw1969
- Meier Valdés et al. (2022) Meier Valdés, E. A., Morris, B. M., Wells, R. D., Schanche, N., & Demory, B. O. 2022, arXiv e-prints, arXiv:2205.08560. https://arxiv.org/abs/2205.08560
- Morris et al. (2021) Morris, B. M., Delrez, L., Brandeker, A., et al. 2021, A&A, 653, A173, doi: 10.1051/0004-6361/202140892
- National Academies of Sciences & Medicine (2021) National Academies of Sciences, E., & Medicine. 2021, Pathways to Discovery in Astronomy and Astrophysics for the 2020s, doi: 10.17226/26141
- Neil et al. (2022) Neil, A. R., Liston, J., & Rogers, L. A. 2022, ApJ, 933, 63, doi: 10.3847/1538-4357/ac609b
- Nikolov et al. (2016) Nikolov, N., Sing, D. K., Gibson, N. P., et al. 2016, ApJ, 832, 191, doi: 10.3847/0004-637X/832/2/191
- Parmentier et al. (2018) Parmentier, V., Line, M. R., Bean, J. L., et al. 2018, A&A, 617, A110, doi: 10.1051/0004-6361/201833059
- Patrascu et al. (2015) Patrascu, A. T., Yurchenko, S. N., & Tennyson, J. 2015, MNRAS, 449, 3613, doi: 10.1093/mnras/stv507
- Pinhas et al. (2019) Pinhas, A., Madhusudhan, N., Gandhi, S., & MacDonald, R. 2019, MNRAS, 482, 1485, doi: 10.1093/mnras/sty2544
- Pinhas et al. (2018) Pinhas, A., Rackham, B. V., Madhusudhan, N., & Apai, D. 2018, MNRAS, 480, 5314, doi: 10.1093/mnras/sty2209
- Pont et al. (2008) Pont, F., Knutson, H., Gilliland, R. L., Moutou, C., & Charbonneau, D. 2008, MNRAS, 385, 109, doi: 10.1111/j.1365-2966.2008.12852.x
- Pont et al. (2013) Pont, F., Sing, D. K., Gibson, N. P., et al. 2013, MNRAS, 432, 2917, doi: 10.1093/mnras/stt651
- Rackham et al. (2017) Rackham, B., Espinoza, N., Apai, D., et al. 2017, ApJ, 834, 151, doi: 10.3847/1538-4357/aa4f6c
- Rasmussen & Williams (2005) Rasmussen, C., & Williams, C. 2005, Gaussian Processes for Machine Learning, Adaptive Computation and Machine Learning series (MIT Press). https://books.google.com/books?id=Tr34DwAAQBAJ
- Richard et al. (2012) Richard, C., Gordon, I. E., Rothman, L. S., et al. 2012, J. Quant. Spec. Radiat. Transf., 113, 1276, doi: 10.1016/j.jqsrt.2011.11.004
- Rothman et al. (2010) Rothman, L. S., Gordon, I. E., Barber, R. J., et al. 2010, J. Quant. Spec. Radiat. Transf., 111, 2139, doi: 10.1016/j.jqsrt.2010.05.001
- Sedaghati et al. (2017) Sedaghati, E., Boffin, H. M. J., MacDonald, R. J., et al. 2017, Nature, 549, 238, doi: 10.1038/nature23651
- Sheppard et al. (2021) Sheppard, K. B., Welbanks, L., Mandell, A. M., et al. 2021, AJ, 161, 51, doi: 10.3847/1538-3881/abc8f4
- Sing et al. (2016) Sing, D. K., Fortney, J. J., Nikolov, N., et al. 2016, Nature, 529, 59, doi: 10.1038/nature16068
- Sivula et al. (2020) Sivula, T., Magnusson, M., Alonzo Matamoros, A., & Vehtari, A. 2020, arXiv e-prints, arXiv:2008.10296. https://arxiv.org/abs/2008.10296
- Skilling (2004) Skilling, J. 2004, in American Institute of Physics Conference Series, Vol. 735, American Institute of Physics Conference Series, ed. R. Fischer, R. Preuss, & U. V. Toussaint, 395–405, doi: 10.1063/1.1835238
- Skilling (2006) Skilling, J. 2006, BayAn, 1, 833, doi: 10.1214/06-BA127
- Sotzen et al. (2020) Sotzen, K. S., Stevenson, K. B., Sing, D. K., et al. 2020, AJ, 159, 5, doi: 10.3847/1538-3881/ab5442
- Spake et al. (2021) Spake, J. J., Sing, D. K., Wakeford, H. R., et al. 2021, MNRAS, 500, 4042, doi: 10.1093/mnras/staa3116
- Stevenson et al. (2017) Stevenson, K. B., Line, M. R., Bean, J. L., et al. 2017, AJ, 153, 68, doi: 10.3847/1538-3881/153/2/68
- Sundararajan & Keerthi (2001) Sundararajan, S., & Keerthi, S. S. 2001, Neural computation, 13, 1103
- Torres et al. (2008) Torres, G., Winn, J. N., & Holman, M. J. 2008, ApJ, 677, 1324, doi: 10.1086/529429
- Trotta (2008) Trotta, R. 2008, ConPh, 49, 71, doi: 10.1080/00107510802066753
- Tsiaras et al. (2018) Tsiaras, A., Waldmann, I. P., Zingales, T., et al. 2018, AJ, 155, 156, doi: 10.3847/1538-3881/aaaf75
- Vehtari et al. (2017) Vehtari, A., Gelman, A., & Gabry, J. 2017, Statistics and computing, 27, 1413
- Vehtari & Ojanen (2012) Vehtari, A., & Ojanen, J. 2012, Statistics Surveys, 6, 142
- Vehtari et al. (2015) Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. 2015, arXiv preprint arXiv:1507.02646
- Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nature Methods, 17, 261, doi: https://doi.org/10.1038/s41592-019-0686-2
- von Essen et al. (2019) von Essen, C., Mallonn, M., Welbanks, L., et al. 2019, A&A, 622, A71, doi: 10.1051/0004-6361/201833837
- Wakeford et al. (2020) Wakeford, H. R., Sing, D. K., Stevenson, K. B., et al. 2020, AJ, 159, 204, doi: 10.3847/1538-3881/ab7b78
- Welbanks & Madhusudhan (2019) Welbanks, L., & Madhusudhan, N. 2019, AJ, 157, 206, doi: 10.3847/1538-3881/ab14de
- Welbanks & Madhusudhan (2021a) —. 2021a, ApJ, 913, 114, doi: 10.3847/1538-4357/abee94
- Welbanks & Madhusudhan (2021b) —. 2021b, arXiv e-prints, arXiv:2112.09125. https://arxiv.org/abs/2112.09125
- Welbanks et al. (2019) Welbanks, L., Madhusudhan, N., Allard, N. F., et al. 2019, ApJ, 887, L20, doi: 10.3847/2041-8213/ab5a89
- Zhang & Stephens (2009) Zhang, J., & Stephens, M. A. 2009, Technometrics, 51, 316
- Zhang et al. (2020) Zhang, M., Chachan, Y., Kempton, E. M. R., Knutson, H. A., & Chang, W. H. 2020, ApJ, 899, 27, doi: 10.3847/1538-4357/aba1e6