Bayesian Inference | Towards Data Science

Are You Sure Your Posterior Makes Sense?

Fri, 11 Apr 2025 18:38:41 +0000

This article is co-authored by Felipe Bandeira, Giselle Fretta, Thu Than, and Elbion Redenica. We also thank Prof. Carl Scheffler for his support.

Introduction

Parameter estimation has been for decades one of the most important topics in statistics. While frequentist approaches, such as Maximum Likelihood Estimations, used to be the gold standard, the advance of computation has opened space for Bayesian methods. Estimating posterior distributions with Mcmc samplers became increasingly common, but reliable inferences depend on a task that is far from trivial: making sure that the sampler — and the processes it executes under the hood — worked as expected. Keeping in mind what Lewis Caroll once wrote: “If you don’t know where you’re going, any road will take you there.”

This article is meant to help data scientists evaluate an often overlooked aspect of Bayesian parameter estimation: the reliability of the sampling process. Throughout the sections, we combine simple analogies with technical rigor to ensure our explanations are accessible to data scientists with any level of familiarity with Bayesian methods. Although our implementations are in Python with PyMC, the concepts we cover are useful to anyone using an MCMC algorithm, from Metropolis-Hastings to NUTS.

Key Concepts

No data scientist or statistician would disagree with the importance of robust parameter estimation methods. Whether the objective is to make inferences or conduct simulations, having the capacity to model the data generation process is a crucial part of the process. For a long time, the estimations were mainly performed using frequentist tools, such as Maximum Likelihood Estimations (MLE) or even the famous Least Squares optimization used in regressions. Yet, frequentist methods have clear shortcomings, such as the fact that they are focused on point estimates and do not incorporate prior knowledge that could improve estimates.

As an alternative to these tools, Bayesian methods have gained popularity over the past decades. They provide statisticians not only with point estimates of the unknown parameter but also with confidence intervals for it, all of which are informed by the data and by the prior knowledge researchers held. Originally, Bayesian parameter estimation was done through an adapted version of Bayes’ theorem focused on unknown parameters (represented as θ) and known data points (represented as x). We can define P(θ|x), the posterior distribution of a parameter’s value given the data, as:

\[ P(\theta|x) = \frac{P(x|\theta) P(\theta)}{P(x)} \]

In this formula, P(x|θ) is the likelihood of the data given a parameter value, P(θ) is the prior distribution over the parameter, and P(x) is the evidence, which is computed by integrating all possible values of the prior:

\[ P(x) = \int_\theta P(x, \theta) d\theta \]

In some cases, due to the complexity of the calculations required, deriving the posterior distribution analytically was not possible. However, with the advance of computation, running sampling algorithms (especially MCMC ones) to estimate posterior distributions has become easier, giving researchers a powerful tool for situations where analytical posteriors are not trivial to find. Yet, with such power also comes a large amount of responsibility to ensure that results make sense. This is where sampler diagnostics come in, offering a set of valuable tools to gauge 1) whether an MCMC algorithm is working well and, consequently, 2) whether the estimated distribution we see is an accurate representation of the real posterior distribution. But how can we know so?

How samplers work

Before diving into the technicalities of diagnostics, we shall cover how the process of sampling a posterior (especially with an MCMC sampler) works. In simple terms, we can think of a posterior distribution as a geographical area we haven’t been to but need to know the topography of. How can we draw an accurate map of the region?

One of our favorite analogies comes from Ben Gilbert. Suppose that the unknown region is actually a house whose floorplan we wish to map. For some reason, we cannot directly visit the house, but we can send bees inside with GPS devices attached to them. If everything works as expected, the bees will fly around the house, and using their trajectories, we can estimate what the floor plan looks like. In this analogy, the floor plan is the posterior distribution, and the sampler is the group of bees flying around the house.

The reason we are writing this article is that, in some cases, the bees won’t fly as expected. If they get stuck in a certain room for some reason (because someone dropped sugar on the floor, for example), the data they return won’t be representative of the entire house; rather than visiting all rooms, the bees only visited a few, and our picture of what the house looks like will ultimately be incomplete. Similarly, when a sampler does not work correctly, our estimation of the posterior distribution is also incomplete, and any inference we draw based on it is likely to be wrong.

Monte Carlo Markov Chain (MCMC)

In technical terms, we call an MCMC process any algorithm that undergoes transitions from one state to another with certain properties. Markov Chain refers to the fact that the next state only depends on the current one (or that the bee’s next location is only influenced by its current place, and not by all of the places where it has been before). Monte Carlo means that the next state is chosen randomly. MCMC methods like Metropolis-Hastings, Gibbs sampling, Hamiltonian Monte Carlo (HMC), and No-U-Turn Sampler (NUTS) all operate by constructing Markov Chains (a sequence of steps) that are close to random and gradually explore the posterior distribution.

Now that you understand how a sampler works, let’s dive into a practical scenario to help us explore sampling problems.

Case Study

Imagine that, in a faraway nation, a governor wants to understand more about public annual spending on healthcare by mayors of cities with less than 1 million inhabitants. Rather than looking at sheer frequencies, he wants to understand the underlying distribution explaining expenditure, and a sample of spending data is about to arrive. The problem is that two of the economists involved in the project disagree about how the model should look.

Model 1

The first economist believes that all cities spend similarly, with some variation around a certain mean. As such, he creates a simple model. Although the specifics of how the economist chose his priors are irrelevant to us, we do need to keep in mind that he is trying to approximate a Normal (unimodal) distribution.

\[
x_i \sim \text{Normal}(\mu, \sigma^2) \text{ i.i.d. for all } i \\
\mu \sim \text{Normal}(10, 2) \\
\sigma^2 \sim \text{Uniform}(0,5)
\]

Model 2

The second economist disagrees, arguing that spending is more complex than his colleague believes. He believes that, given ideological differences and budget constraints, there are two kinds of cities: the ones that do their best to spend very little and the ones that are not afraid of spending a lot. As such, he creates a slightly more complex model, using a mixture of normals to reflect his belief that the true distribution is bimodal.

\[
x_i \sim \text{Normal-Mixture}([\omega, 1-\omega], [m_1, m_2], [s_1^2, s_2^2]) \text{ i.i.d. for all } i\\
m_j \sim \text{Normal}(2.3, 0.5^2) \text{ for } j = 1,2 \\
s_j^2 \sim \text{Inverse-Gamma}(1,1) \text{ for } j=1,2 \\
\omega \sim \text{Beta}(1,1)
\]

After the data arrives, each economist runs an MCMC algorithm to estimate their desired posteriors, which will be a reflection of reality (1) if their assumptions are true and (2) if the sampler worked correctly. The first if, a discussion about assumptions, shall be left to the economists. However, how can they know whether the second if holds? In other words, how can they be sure that the sampler worked correctly and, as a consequence, their posterior estimations are unbiased?

Sampler Diagnostics

To evaluate a sampler’s performance, we can explore a small set of metrics that reflect different parts of the estimation process.

Quantitative Metrics

R-hat (Potential Scale Reduction Factor)

In simple terms, R-hat evaluates whether bees that started at different places have all explored the same rooms at the end of the day. To estimate the posterior, an MCMC algorithm uses multiple chains (or bees) that start at random locations. R-hat is the metric we use to assess the convergence of the chains. It measures whether multiple MCMC chains have mixed well (i.e., if they have sampled the same topography) by comparing the variance of samples within each chain to the variance of the sample means across chains. Intuitively, this means that

\[
\hat{R} = \sqrt{\frac{\text{Variance Between Chains}}{\text{Variance Within Chains}}}
\]

If R-hat is close to 1.0 (or below 1.01), it means that the variance within each chain is very similar to the variance between chains, suggesting that they have converged to the same distribution. In other words, the chains are behaving similarly and are also indistinguishable from one another. This is precisely what we see after sampling the posterior of the first model, shown in the last column of the table below:

Figure 1. Summary statistics of the sampler highlighting ideal R-hats.

The r-hat from the second model, however, tells a different story. The fact we have such large r-hat values indicates that, at the end of the sampling process, the different chains had not converged yet. In practice, this means that the distribution they explored and returned was different, or that each bee created a map of a different room of the house. This fundamentally leaves us without a clue of how the pieces connect or what the complete floor plan looks like.

Figure 2. Summary statistics of the sampler showcasing problematic R-hats.

Given our R-hat readouts were large, we know something went wrong with the sampling process in the second model. However, even if the R-hat had turned out within acceptable levels, this does not give us certainty that the sampling process worked. R-hat is just a diagnostic tool, not a guarantee. Sometimes, even if your R-hat readout is lower than 1.01, the sampler might not have properly explored the full posterior. This happens when multiple bees start their exploration in the same room and remain there. Likewise, if you’re using a small number of chains, and if your posterior happens to be multimodal, there is a probability that all chains started in the same mode and failed to explore other peaks.

The R-hat readout reflects convergence, not completion. In order to have a more comprehensive idea, we need to check other diagnostic metrics as well.

Effective Sample Size (ESS)

When explaining what MCMC was, we mentioned that “Monte Carlo” refers to the fact that the next state is chosen randomly. This does not necessarily mean that the states are fully independent. Even though the bees choose their next step at random, these steps are still correlated to some extent. If a bee is exploring a living room at time t=0, it will probably still be in the living room at time t=1, even though it is in a different part of the same room. Due to this natural connection between samples, we say these two data points are autocorrelated.

Due to their nature, MCMC methods inherently produce autocorrelated samples, which complicates statistical analysis and requires careful evaluation. In statistical inference, we often assume independent samples to ensure that the estimates of uncertainty are accurate, hence the need for uncorrelated samples. If two data points are too similar to each other, the correlation reduces their effective information content. Mathematically, the formula below represents the autocorrelation function between two time points (t1 and t2) in a random process:

\[
R_{XX}(t_1, t_2) = E[X_{t_1} \overline{X_{t_2}}]
\]

where E is the expected value operator and X-bar is the complex conjugate. In MCMC sampling, this is crucial because high autocorrelation means that new samples don’t teach us anything different from the old ones, effectively reducing the sample size we have. Unsurprisingly, the metric that reflects this is called Effective Sample Size (ESS), and it helps us determine how many truly independent samples we have.

As hinted previously, the effective sample size accounts for autocorrelation by estimating how many truly independent samples would provide the same information as the autocorrelated samples we have. Mathematically, for a parameter θ, the ESS is defined as:

\[
ESS = \frac{n}{1 + 2 \sum_{k=1}^{\infty} \rho(\theta)_k}
\]

where n is the total number of samples and ρ(θ)k is the autocorrelation at lag k for parameter θ.

Typically, for ESS readouts, the higher, the better. This is what we see in the readout for the first model. Two common ESS variations are Bulk-ESS, which assesses mixing in the central part of the distribution, and Tail-ESS, which focuses on the efficiency of sampling the distribution’s tails. Both inform us if our model accurately reflects the central tendency and credible intervals.

Figure 3. Summary statistics of the sampler highlighting ideal quantities for ESS bulk and tail.

In contrast, the readouts for the second model are very bad. Typically, we want to see readouts that are at least 1/10 of the total sample size. In this case, given each chain sampled 2000 observations, we should expect ESS readouts of at least 800 (from the total size of 8000 samples across 4 chains of 2000 samples each), which is not what we observe.

Figure 4. Summary statistics of the sampler demonstrating problematic ESS bulk and tail.

Visual Diagnostics

Apart from the numerical metrics, our understanding of sampler performance can be deepened through the use of diagnostic plots. The main ones are rank plots, trace plots, and pair plots.

Rank Plots

A rank plot helps us identify whether the different chains have explored all of the posterior distribution. If we once again think of the bee analogy, rank plots tell us which bees explored which parts of the house. Therefore, to evaluate whether the posterior was explored equally by all chains, we observe the shape of the rank plots produced by the sampler. Ideally, we want the distribution of all chains to look roughly uniform, like in the rank plots generated after sampling the first model. Each color below represents a chain (or bee):

Figure 5. Rank plots for parameters ‘m’ and ‘s’ across four MCMC chains. Each bar represents the distribution of rank values for one chain, with ideally uniform ranks indicating good mixing and proper convergence.

Under the hood, a rank plot is produced with a simple sequence of steps. First, we run the sampler and let it sample from the posterior of each parameter. In our case, we are sampling posteriors for parameters m and s of the first model. Then, parameter by parameter, we get all samples from all chains, put them together, and order them from smallest to largest. We then ask ourselves, for each sample, what was the chain where it came from? This will allow us to create plots like the ones we see above.

In contrast, bad rank plots are easy to spot. Unlike the previous example, the distributions from the second model, shown below, are not uniform. From the plots, what we interpret is that each chain, after beginning at different random locations, got stuck in a region and did not explore the entirety of the posterior. Consequently, we cannot make inferences from the results, as they are unreliable and not representative of the true posterior distribution. This would be equivalent to having four bees that started at different rooms of the house and got stuck somewhere during their exploration, never covering the entirety of the property.

Figure 6. Rank plots for parameters m, s_squared, and w across four MCMC chains. Each subplot shows the distribution of ranks by chain. There are noticeable deviations from uniformity (e.g., stair-step patterns or imbalances across chains) suggesting potential sampling issues.

KDE and Trace Plots

Similar to R-hat, trace plots help us assess the convergence of MCMC samples by visualizing how the algorithm explores the parameter space over time. PyMC provides two types of trace plots to diagnose mixing issues: Kernel Density Estimate (KDE) plots and iteration-based trace plots. Each of these serves a distinct purpose in evaluating whether the sampler has properly explored the target distribution.

The KDE plot (usually on the left) estimates the posterior density for each chain, where each line represents a separate chain. This allows us to check whether all chains have converged to the same distribution. If the KDEs overlap, it suggests that the chains are sampling from the same posterior and that mixing has occurred. On the other hand, the trace plot (usually on the right) visualizes how parameter values change over MCMC iterations (steps), with each line representing a different chain. A well-mixed sampler will produce trace plots that look noisy and random, with no clear structure or separation between chains.

Using the bee analogy, trace plots can be thought of as snapshots of the “features” of the house at different locations. If the sampler is working correctly, the KDEs in the left plot should align closely, showing that all bees (chains) have explored the house similarly. Meanwhile, the right plot should show highly variable traces that blend together, confirming that the chains are actively moving through the space rather than getting stuck in specific regions.

Figure 7. Density and trace plots for parameters m and s from the first model across four MCMC chains. The left panel shows kernel density estimates (KDE) of the marginal posterior distribution for each chain, indicating consistent central tendency and spread. The right panel displays the trace plot over iterations, with overlapping chains and no apparent divergences, suggesting good mixing and convergence.

However, if your sampler has poor mixing or convergence issues, you will see something like the figure below. In this case, the KDEs will not overlap, meaning that different chains have sampled from different distributions rather than a shared posterior. The trace plot will also show structured patterns instead of random noise, indicating that chains are stuck in different regions of the parameter space and failing to fully explore it.

Figure 8. KDE (left) and trace plots (right) for parameters m, s_squared, and w across MCMC chains for the second model. Multimodal distributions are visible for m and w, suggesting potential identifiability issues. Trace plots reveal that chains explore different modes with limited mixing, particularly for m, highlighting challenges in convergence and effective sampling.

By using trace plots alongside the other diagnostics, you can identify sampling issues and determine whether your MCMC algorithm is effectively exploring the posterior distribution.

Pair Plots

A third kind of plot that is often useful for diagnostic are pair plots. In models where we want to estimate the posterior distribution of multiple parameters, pair plots allow us to observe how different parameters are correlated. To understand how such plots are formed, think again about the bee analogy. If you imagine that we’ll create a plot with the width and length of the house, each “step” that the bees take can be represented by an (x, y) combination. Likewise, each parameter of the posterior is represented as a dimension, and we create scatter plots showing where the sampler walked using parameter values as coordinates. Here, we are plotting each unique pair (x, y), resulting in the scatter plot you see in the middle of the image below. The one-dimensional plots you see on the edges are the marginal distributions over each parameter, giving us additional information on the sampler’s behavior when exploring them.

Take a look at the pair plot from the first model.

Figure 9. Joint posterior distribution of parameters m and s, with marginal densities. The scatter plot shows a roughly symmetric, elliptical shape, suggesting a low correlation between m and s.

Each axis represents one of the two parameters whose posteriors we are estimating. For now, let’s focus on the scatter plot in the middle, which shows the parameter combinations sampled from the posterior. The fact we have a very even distribution means that, for any particular value of m, there was a range of values of s that were equally likely to be sampled. Additionally, we don’t see any correlation between the two parameters, which is usually good! There are cases when we would expect some correlation, such as when our model involves a regression line. However, in this instance, we have no reason to believe two parameters should be highly correlated, so the fact we don’t observe unusual behavior is positive news.

Now, take a look at the pair plots from the second model.

Figure 10. Pair plot of the joint posterior distributions for parameters m, s_squared, and w. The scatter plots reveal strong correlations between several parameters.

Given that this model has five parameters to be estimated, we naturally have a greater number of plots since we are analyzing them pair-wise. However, they look odd compared to the previous example. Namely, rather than having an even distribution of points, the samples here either seem to be divided across two regions or seem somewhat correlated. This is another way of visualizing what the rank plots have shown: the sampler did not explore the full posterior distribution. Below we isolated the top left plot, which contains the samples from m0 and m1. Unlike the plot from model 1, here we see that the value of one parameter greatly influences the value of the other. If we sampled m1 around 2.5, for example, m0 is likely to be sampled from a very narrow range around 1.5.

Figure 11. Joint posterior distribution of parameters m₀ and m₁, with marginal densities.

Certain shapes can be observed in problematic pair plots relatively frequently. Diagonal patterns, for example, indicate a high correlation between parameters. Banana shapes are often connected to parametrization issues, often being present in models with tight priors or constrained parameters. Funnel shapes might indicate hierarchical models with bad geometry. When we have two separate islands, like in the plot above, this can indicate that the posterior is bimodal AND that the chains haven’t mixed well. However, keep in mind that these shapes might indicate problems, but not necessarily do so. It’s up to the data scientist to examine the model and determine which behaviors are expected and which ones are not!

Some Fixing Techniques

When your diagnostics indicate sampling problems — whether concerning R-hat values, low ESS, unusual rank plots, separated trace plots, or strange parameter correlations in pair plots — several strategies can help you address the underlying issues. Sampling problems typically stem from the target posterior being too complex for the sampler to explore efficiently. Complex target distributions might have:

Multiple modes (peaks) that the sampler struggles to move between
Irregular shapes with narrow “corridors” connecting different regions
Areas of drastically different scales (like the “neck” of a funnel)
Heavy tails that are difficult to sample accurately

In the bee analogy, these complexities represent houses with unusual floor plans — disconnected rooms, extremely narrow hallways, or areas that change dramatically in size. Just as bees might get trapped in specific regions of such houses, MCMC chains can get stuck in certain areas of the posterior.

Figure 12. Examples of multimodal target distributions.

Figure 13. Examples of weirdly shaped distributions.

To help the sampler in its exploration, there are simple strategies we can use.

Strategy 1: Reparameterization

Reparameterization is particularly effective for hierarchical models and distributions with challenging geometries. It involves transforming your model’s parameters to make them easier to sample. Back to the bee analogy, imagine the bees are exploring a house with a peculiar layout: a spacious living room that connects to the kitchen through a very, very narrow hallway. One aspect we hadn’t mentioned before is that the bees have to fly in the same way through the entire house. That means that if we dictate the bees should use large “steps,” they will explore the living room very well but hit the walls in the hallway head-on. Likewise, if their steps are small, they will explore the narrow hallway well, but take forever to cover the entire living room. The difference in scales, which is natural to the house, makes the bees’ job more difficult.

A classic example that represents this scenario is Neal’s funnel, where the scale of one parameter depends on another:

\[
p(y, x) = \text{Normal}(y|0, 3) \times \prod_{n=1}^{9} \text{Normal}(x_n|0, e^{y/2})
\]

Figure 14. Log the marginal density of y and the first dimension of Neal’s funnel. The neck is where the sampler is struggling to sample from and the step size is required to be much smaller than the body. (Image source: Stan User’s Guide)

We can see that the scale of x is dependent on the value of y. To fix this problem, we can separate x and y as independent standard Normals and then transform these variables into the desired funnel distribution. Instead of sampling directly like this:

\[
\begin{align*}
y &\sim \text{Normal}(0, 3) \\
x &\sim \text{Normal}(0, e^{y/2})
\end{align*}
\]

You can reparameterize to sample from standard Normals first:

\[
y_{raw} \sim \text{Standard Normal}(0, 1) \\
x_{raw} \sim \text{Standard Normal}(0, 1) \\
\\
y = 3y_{raw} \\
x = e^{y/2} x_{raw}
\]

This technique separates the hierarchical parameters and makes sampling more efficient by eliminating the dependency between them.

Reparameterization is like redesigning the house such that instead of forcing the bees to find a single narrow hallway, we create a new layout where all passages have similar widths. This helps the bees use a consistent flying pattern throughout their exploration.

Strategy 2: Handling Heavy-tailed Distributions

Heavy-tailed distributions like Cauchy and Student-T present challenges for samplers and the ideal step size. Their tails require larger step sizes than their central regions (similar to very long hallways that require the bees to travel long distances), which creates a challenge:

Small step sizes lead to inefficient sampling in the tails
Large step sizes cause too many rejections in the center

Figure 15. Probability density functions for various Cauchy distributions illustrate the effects of changing the location parameter and scale parameter. (Image source: Wikipedia)

Reparameterization solutions include:

For Cauchy: Defining the variable as a transformation of a Uniform distribution using the Cauchy inverse CDF
For Student-T: Using a Gamma-Mixture representation

Strategy 3: Hyperparameter Tuning

Sometimes the solution lies in adjusting the sampler’s hyperparameters:

Increase total iterations: The simplest approach — give the sampler more time to explore.
Increase target acceptance rate (adapt_delta): Reduce divergent transitions (try 0.9 instead of the default 0.8 for complex models, for example).
Increase max_treedepth: Allow the sampler to take more steps per iteration.
Extend warmup/adaptation phase: Give the sampler more time to adapt to the posterior geometry.

Remember that while these adjustments may improve your diagnostic metrics, they often treat symptoms rather than underlying causes. The previous strategies (reparameterization and better proposal distributions) typically offer more fundamental solutions.

Strategy 4: Better Proposal Distributions

This solution is for function fitting processes, rather than sampling estimations of the posterior. It basically asks the question: “I’m currently here in this landscape. Where should I jump to next so that I explore the full landscape, or how do I know that the next jump is the jump I should make?” Thus, choosing a good distribution means making sure that the sampling process explores the full parameter space instead of just a specific region. A good proposal distribution should:

Have substantial probability mass where the target distribution does.
Allow the sampler to make jumps of the appropriate size.

One common choice of the proposal distribution is the Gaussian (Normal) distribution with mean μ and standard deviation σ — the scale of the distribution that we can tune to decide how far to jump from the current position to the next position. If we choose the scale for the proposal distribution to be too small, it might either take too long to explore the entire posterior or it will get stuck in a region and never explore the full distribution. But if the scale is too large, you might never get to explore some regions, jumping over them. It’s like playing ping-pong where we only reach the two edges but not the middle.

Improve Prior Specification

When all else fails, reconsider your model’s prior specifications. Vague or weakly informative priors (like uniformly distributed priors) can sometimes lead to sampling difficulties. More informative priors, when justified by domain knowledge, can help guide the sampler toward more reasonable regions of the parameter space. Sometimes, despite your best efforts, a model may remain challenging to sample effectively. In such cases, consider whether a simpler model might achieve similar inferential goals while being more computationally tractable. The best model is often not the most complex one, but the one that balances complexity with reliability. The table below shows the summary of fixing strategies for different issues.

Diagnostic Signal	Potential Issue	Recommended Fix
High R-hat	Poor mixing between chains	Increase iterations, adjust the step size
Low ESS	High autocorrelation	Reparameterization, increase adapt_delta
Non-uniform rank plots	Chains stuck in different regions	Better proposal distribution, start with multiple chains
Separated KDEs in trace plots	Chains exploring different distributions	Reparameterization
Funnel shapes in pair plots	Hierarchical model issues	Non-centered reparameterization
Disjoint clusters in pair plots	Multimodality with poor mixing	Adjusted distribution, simulated annealing

Conclusion

Assessing the quality of MCMC sampling is crucial for ensuring reliable inference. In this article, we explored key diagnostic metrics such as R-hat, ESS, rank plots, trace plots, and pair plots, discussing how each helps determine whether the sampler is performing properly.

If there’s one takeaway we want you to keep in mind it’s that you should always run diagnostics before drawing conclusions from your samples. No single metric provides a definitive answer — each serves as a tool that highlights potential issues rather than proving convergence. When problems arise, strategies such as reparameterization, hyperparameter tuning, and prior specification can help improve sampling efficiency.

By combining these diagnostics with thoughtful modeling decisions, you can ensure a more robust analysis, reducing the risk of misleading inferences due to poor sampling behavior.

References

B. Gilbert, Bob’s bees: the importance of using multiple bees (chains) to judge MCMC convergence (2018), Youtube

Chi-Feng, MCMC demo (n.d.), GitHub

D. Simpson, Maybe it’s time to let the old ways die; or We broke R-hat so now we have to fix it. (2019), Statistical Modeling, Causal Inference, and Social Science

M. Taboga, Markov Chain Monte Carlo (MCMC) methods (2021), Lectures on probability theory and mathematical Statistics. Kindle Direct Publishing.

T. Wiecki, MCMC Sampling for Dummies (2024), twecki.io
Stan User’s Guide, Reparametrization (n.d.), Stan Documentation

The post Are You Sure Your Posterior Makes Sense? appeared first on Towards Data Science.

]]>

Bayesian Sensor Calibration

Wed, 01 May 2024 05:37:45 +0000

With contributions from Moritz Berger.

Bayesian sensor calibration is an emerging technique combining statistical models and data to optimally calibrate sensors – a crucial engineering procedure. This tutorial provides the Python code to perform such calibration numerically using existing libraries with a minimal math background. As an example case study, we consider a magnetic field sensor whose sensitivity drifts with temperature.

Glossary. The bolded terms are defined in the International Vocabulary of Metrology (known as the "VIM definitions"). Only the first occurrence is in bold.

Code availability. An executable Jupyter notebook for the tutorial is available on Github. It can be accessed via nbviewer.

Introduction

CONTEXT. Physical sensors provide the primary inputs that enable systems to make sense of their environment. They measure physical quantities such as temperature, electric current, power, speed, or light intensity. A measurement result is an estimate of the true value of the measured quantity (the so-called measurand). Sensors are never perfect. Non-idealities, part-to-part variations, and random noise all contribute to the sensor error. Sensor calibration and the subsequent adjustment are critical steps to minimize sensor measurement uncertainty. The Bayesian approach provides a mathematical framework to represent uncertainty. In particular, how uncertainty is reduced by "smart" calibration combining prior knowledge about past samples and the new evidence provided by the calibration. The math for the exact analytical solution can be intimidating (Berger 2022), even in simple cases where a sensor response is modeled as a polynomial transfer function with noise. Luckily, Python libraries were developed to facilitate Bayesian statistical modeling. Hence, Bayesian models are increasingly accessible to engineers. While hands-on tutorials exist (Copley 2023; Watts 2020) and even textbooks (Davidson-Pilon 2015; Martin 2021), they lack sensor calibration examples.

OBJECTIVE. In this article, we aim to reproduce a simplified case inspired by (Berger 2002) and illustrated in the figure below. A sensor is intended to measure the current i flowing through a wire via the measurement of the magnetic field B which is directly proportional to the current. We focus on the magnetic sensor and consider the following non-idealities. (1) The temperature B is a parasitic influence quantity disturbing the measurement. (2) The sensor response varies from one sensor to another due to part-to-part manufacturing variations. (3) The sensed data is polluted by random errors. Using a Bayesian approach and the high-level PyMC Python library, we seek to calculate the optimum set of calibration parameters for a given calibration dataset.

(a) Electric current sensor application. (b) Functional view (Image by author).

Mathematical formulation

We assume that the magnetic sensor consists of a magnetic and temperature transducer, which can be modeled as polynomial transfer functions with coefficients varying from one sensor to the next according to normal probability distributions. The raw sensed data (called "indications" in the VIM), represented by the vector u, consists of the linearly sensed magnetic field S(T)⋅B and a dimensionless sensed quantity _VT indicative of the temperature. We use the specific form S(T)⋅B to highlight that the sensitivity S is influenced by the temperature. The parasitic influence of the temperature is illustrated by the plot in panel (a) below. Ideally, the sensitivity would be independent of the temperature and of _VT. However, there is a polynomial dependency. The case study being inspired from a real magnetic Hall sensor, the sensitivity can vary by ±40% from its value at room temperature in the temperature range [−40°C, +165°C]. In addition, due to part-to-part variations, there is a set of curves S vs _VT instead of just one curve. Mathematically, we want to identify the measurement function returning an accurate estimate of the true value of the magnetic field – like shown in panel (b). Conceptually, this is equivalent to inverting the sensor response model. This boils down to estimating the temperature-dependent sensitivity and using this estimate Ŝ to recover the field from the sensed field by a division.

(a) Raw responses of N=30 sensors. (b) Block diagram (image by author).

For our simplified case, we consider that S(T) and VT(T) are second-order polynomials. The polynomial coefficients are assumed to vary around their nominal values following normal distributions. An extra random noise term is added to both sensed signals. Physically, S is the sensitivity relative to its value at room temperature, and VT is a normalized voltage from a temperature sensor. This is representative of a large class of sensors, where the primary transducer is linear but temperature-dependent, and a supplementary temperature sensor is used to correct the parasitic dependence. And both transducers are noisy. We assume that a third-order polynomial in VT is a suitable candidate for the sensitivity estimate Ŝ:

_Ŝ = w_0 + w1⋅Δ_T + w2⋅Δ_T² + w3⋅Δ_T_³, where ΔT = T−25°C.

The weight vector w aggregates the 4 coefficients of the polynomial. These are the calibration parameters to be adjusted as a result of calibration.

Python formulation

We use the code convention introduced in (Close, 2021). We define a data dictionary dd to store the nominal value of the parameters. In addition, we define the probability density functions capturing the variability of the parameters. The sensor response is modeled as a transfer function, like the convention introduced in (Close 2021).

# Data dictionary: nominal parameters of the sensor response model
def load_dd():
  return Box({
    'S'  : {
      'TC'   :   [1, -5.26e-3, 15.34e-6],
      'noise':   0.12/100,  },
    'VT': {
      'TC':      [0, 1.16e-3, 2.78e-6],
      'noise':   0.1/100,}  
  })

# Probability density functions for the parameter variations
pdfs = {
    'S.TC': (norm(0,1.132e-2), norm(0,1.23e-4), norm(0,5.40e-7)),
    'VT.TC'   : (norm(0,7.66e-6), norm(0,4.38e-7))
}

# Sensor response model
def sensor_response_model(T, sensor_id=0, dd={}, delta={}):
    S=np.poly1d(np.flip((dd.S.TC+delta.loc[sensor_id]['S.TC'])))(T-25)
    S+=np.random.normal(0, dd.S.noise, size=len(T))

    VT = 10*np.poly1d(np.flip(dd.VT.TC+np.r_[0,delta.loc[sensor_id]['VT.TC'].values]))(T-25)
    VT+= np.random.normal(0, dd.VT.noise, size=len(T))

    return {'S': S, 'VT': VT}

We can then simulate a first set of N=30 sensors by drawing from the specified probability distributions, and generate synthetic data in df1 to test various calibration approaches via a build function build_sensors(ids=[..]).

df1,_=build_sensors_(ids=np.arange(30))

Synthetic data generated by the probabilistic sensor response model (Image by author).

Classic approaches

We first consider two well-known classic calibration approaches that don’t rely on the Bayesian framework.

Full regression

The first calibration approach is a brute-force approach. A comprehensive dataset is collected for each sensor, with more calibration points than unknowns. The calibration parameters w for each sensor (4 unknowns) are determined by a regression fit. Naturally, this approach provides the best results in terms of residual error. However, it is very costly in practice, as it requires a full characterization of each individual sensor. The following function performs the full calibration and store the weights as a list in the dataframe for convenience.

def full_calibration(df):
    W = df.groupby("id").apply(
        lambda g: ols("S ~ 1 + VT + I(VT**2)+ I(VT**3)", data=g).fit().params
    )
    W.columns = [f"w_{k}" for k, col in enumerate(W.columns)]
    df["w"] = df.apply(lambda X: W.loc[X.name[0]].values, axis=1)

df1, W=full_calibration(df1)

Blind calibration

A blind calibration represents the other extreme. In this approach, a first reference set of sensors is fully calibrated as above. The following sensors are not individually calibrated. Instead, the average calibration parameters w0 of the reference set is reused blindly.

w0 = W.mean().values

df2,_=build_sensors_(ids=[0])
def blind_calibration(df):
    return df.assign(w=[w0]*len(df))
df2 = blind_calibration(df2)

The following plots illustrate the residual sensitivity error Ŝ−S for the two above methods. Recall that the error before calibration reaches up to 40%. The green curves are the sensitivity error for the N=30 sensors from the reference set. Apart from a residual fourth-order error (unavoidable to the limited order of the sensitivity estimator), the fit is satisfactory (<2%). The red curve is the residual sensitivity error for a __ blindly calibrated sensor. Due to part-to-part variation, the average calibration parameters provide only an approximate fit, and the residual error is not satisfactory.

Residual sensitivity error for N=30 fully calibrated and N=1 blindly calibrated sensor. (Image by author).

Bayesian calibration

The Bayesian calibration is an interesting trade-off between the previous two extremes. A reference set of sensors is fully calibrated as above. The calibration parameters for this reference set constitute some prior knowledge. The average w0 and the covariance matrix Σ encode some relevant knowledge about the sensor response. The weights are not independent. Some combinations are more probable than others. Such knowledge should be used in a smart calibration. The coverage matrix can be calculated and plotted (for just two weights) using the Pandas and Seaborn libraries.

Cov0 = W.cov(ddof=len(W) - len(w0))
sns.jointplot(data=W.apply(pd.Series),x='w_1', y='w_2', kind='kde', fill=True, height=4)

Bivariate plot of the two weights w_1 and _w2 (Image by author).

The Bayesian framework enables us to capture this prior knowledge, and uses it in the calibration of the subsequent samples. We restart from the same blindly calibrated samples above. We simulate the case where just two calibration data points at 0°C and 100°C are collected for each new sensor, enriching our knowledge with new evidence. In practical industrial scenarios where hardware calibration is expensive, this calibration is cost-effective. A reduced reference set is fully characterized once for all to gather prior knowledge. The subsequent samples, possibly the rest of the volume production of this batch, are only characterized at a few points. In Bayesian terminology, this is called "inference", and the PyMC library provides high-level functions to perform inference. It is a computation-intensive process because the posterior distribution, which is obtained by applying the Bayes’ theorem combining the prior knowledge and the new evidence, can only be sampled. There is no analytical approximation of the obtained probability density function.

The calibration results are compared below, with the blue dot indicating the two calibration points used by the Bayesian approach. With just two extra points and by exploiting the prior knowledge acquired on the reference set, the Bayesian calibrated sensor exhibits an error hardly degraded compared to the expensive brute-force approach.

Comparison of the three calibration approaches.

Credible interval

In the Bayesian approach, all variables are characterized by uncertainty in. The parameters of the sensor model, the calibration parameters, but also the posterior predictions. We can then construct a ±1σ credible interval covering 68% of the synthetic observations generated by the model for the estimated sensitivity Ŝ. This plot captures the essence of the calibration and adjustment: the uncertainty has been reduced around the two calibration points at T=0°C and T=100°C. The residual uncertainty is due to noisy measurements.

Credible interval (Image by author).

Conclusion

This article presented a Python workflow for simulating Bayesian sensor calibration, and compared it against well-known classic approaches. The mathematical and Python formulations are valid for a wide class of sensors, enabling sensor design to explore various approaches. The workflow can be summarized as follows:

Model the sensor response in terms of transfer function and the parameters (nominal values and statistical variations). Generate corresponding synthetic raw sensed data for a batch of sensors.
Define the form of the measurement function from the raw sensed variables. Typically, this is a polynomial, and the calibration should determine the optimum coefficients w of this polynomial for each sensor.
Acquire some prior knowledge by fully characterizing a representative subset of sensors. Encode this knowledge in the form of average calibration parameters and covariance matrix.
Acquire limited new evidence in the form of a handful of calibration points specific to each sensor.
Perform Bayesian Inference merging this new sparse evidence with the prior knowledge to find the most likely calibration parameters for this new sensor numerically using PyMC.

In the frequent cases where sensor Calibration is a significant factor of the production cost, Bayesian calibration exhibits substantial business benefits. Consider a batch of 1’000 sensors. One can obtain representative prior knowledge with a full characterization from, say, only 30 sensors. And then for the other 970 sensors, use a few calibration points only. In a classical approach, these extra calibration points lead to an undetermined system of equations. In the Bayesian framework, the prior knowledge fills the gap.

References

(Berger 2022) M. Berger, C. Schott, and O. Paul, "Bayesian sensor calibration," IEEE Sens. J., 2022. https://doi.org/10.1109/JSEN.2022.3199485.

(Close, 2021): G. Close, "Signal chain analysis in python: a case study for hardware engineers," Towards Data Science, 22-Feb-2021. Available: https://towardsdatascience.com/signal-chain-analysis-in-python-84513fcf7db2.

(Copley 2023) C. Copley, "Navigating Bayesian Analysis with PyMC," Towards Data Science, Jun-2023. https://charlescopley.medium.com/navigating-bayesian-analysis-with-pymc-87683c91f3e4

(Davidson-Pilon 2015) C. Davidson-Pilon, "Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference," Addison-Wesley Professional, 2015. https://www.amazon.com/Bayesian-Methods-Hackers-Probabilistic-Addison-Wesley/dp/0133902838

(Martin 2021) O. A. Martin, R. Kumar, and J. Lao, "Bayesian Modeling and Computation in Python," __ Chapman and Hall/CRC, 2021. https://www.amazon.com/Bayesian-Modeling-Computation-Chapman-Statistical/dp/036789436X

(Watts 2020) A. Watts, "PyMC3 and Bayesian inference for Parameter Uncertainty Quantification Towards Non-Linear Models: Part 2," Towards Data Science, Jun-2022. https://towardsdatascience.com/pymc3-and-bayesian-inference-for-parameter-uncertainty-quantification-towards-non-linear-models-a03c3303e6fa

The post Bayesian Sensor Calibration appeared first on Towards Data Science.

]]>

Bayesian Data Science: The What, Why, and How

Thu, 18 Apr 2024 02:30:33 +0000

Choosing between frequentist and Bayesian approaches is the great debate of the last century, with a recent surge in Bayesian adoption in the sciences.

Number of articles referring Bayesian statistics in sciencedirect.com (April 2024) – Graph by the author

What’s the difference?

The philosophical difference is actually quite subtle, where some propose that the great bayesian critic, Fisher, was himself a bayesian in some regard. While there are countless articles that delve into formulaic differences, what are the practical benefits? What does Bayesian analysis offer to the lay data scientist that the vast plethora of highly-adopted frequentist methods do not already? This article aims to give a practical introduction to the motivation, formulation, and application of Bayesian methods. Let’s dive in.

Prior Beliefs

While frequentists deal with describing the exact distributions of any data, the bayesian viewpoint is more subjective. Subjectivity and statistics?! Yes, it’s actually compatible.

Let’s start with something simple, like a coin flip. Suppose you flip a coin 10 times, and get heads 7 times. What is the probability of heads?

P(heads) = 7/10 (0.7)?

Obviously, here we are riddled with low sample size. In a Bayesian POV however, we are allowed to encode our beliefs directly, asserting that if the coin is fair, the chance of heads or tails must be equal i.e. 1/2. While in this example the choice seems pretty obvious, the debate is more nuanced when we get to more complex, less obvious phenomenon.

Yet, this simple example is a powerful starting point, highlighting both the greatest benefit and shortcoming of Bayesian analysis:

Benefit: Dealing with a lack of data. Suppose you are modeling spread of an infection in a country where data collection is scarce. Will you use the low amount of data to derive all your insights? Or would you want to factor-in commonly seen patterns from similar countries into your model i.e. informed prior beliefs. Although the choice is clear, it leads directly to the shortcoming.

Shortcoming: the prior belief is hard to formulate. For example, if the coin is not actually fair, it would be wrong to assume that P (heads) = 0.5, and there is almost no way to find true P (heads) without a long run experiment. In this case, assuming P (heads) = 0.5 would actually be detrimental to finding the truth. Yet every statistical model (frequentist or Bayesian) must make assumptions at some level, and the ‘statistical inferences’ in the human mind are actually a lot like Bayesian Inference i.e. constructing prior belief systems that factor into our decisions in every new situation. Additionally, formulating wrong prior beliefs is often not a death sentence from a modeling perspective either, if we can learn from enough data (more on this in later articles).

Bayes’ Rule

So what does all this look like mathematically? Bayes’ rule lays the groundwork. Let’s suppose we have a parameter θ that defines some model which could describe our data (eg. θ could represent the mean, variance, slope w.r.t covariate, etc.). Bayes’ rule states that

Thomas Bayes formulated the Bayes’ theorem in 1700’s, published posthumously. [Image via Wikimedia commons licensed under Creative Commons Attribution-Share Alike 4.0 International, unadapted]

*P (θ = t|data) ∝ P (data|θ = t) P (θ=t)**

In more simple words,

P (θ = t|data) represents the conditional probability that θ is equal to t, given our data (a.k.a the posterior).
Conversely, P (data|θ) represents the probability of observing our data, if θ = t (a.k.a the ‘likelihood‘).
Finally, P (θ=t) is simply the probability that θ takes the value t (the infamous ‘prior‘).

So what’s this mysterious t? It can take many possible values, depending on what θ means. In fact, you want to try a lot of values, and check the likelihood of your data for each. This is a key step, and you really really hope that you checked the best possible values for θ i.e. those which cover the maximum likelihood area of seeing your data (global minima, for those who care).

And that’s the crux of everything Bayesian inference does!

Form a prior belief for possible values of θ,
Scale it with the likelihood at each θ value, given the observed data, and
Return the computed result i.e. the posterior, which tells you the probability of each tested θ value.

Graphically, this looks something like:

Prior (left) scaled with the likelihood (middle) forms the posterior (right) (figures adapted from Andrew Gelmans Book). Here, θ encodes the east-west location coordinate of a plane. The prior belief is that the plane is more towards the east than west. The data challenges the prior and the posterior thus lies somehwere in the middle. [image using data generated by author]

Which highlights the next big advantages of Bayesian stats-

We have an idea of the entire shape of θ’s distribution (eg, how wide is the peak, how heavy are the tails, etc.) which can enable more robust inferences. Why? Simply because we can not only better understand but also quantify the uncertainty (as compared to a traditional point estimate with standard deviation).
Since the process is iterative, we can constantly update our beliefs (estimates) as more data flows into our model, making it much easier to build fully online models.

Easy enough! But not quite…

This process involves a lot of computations, where you have to calculate the likelihood for each possible value of θ. Okay, maybe this is easy if suppose θ lies in a small range like [0,1]. We can just use the brute-force grid method, testing values at discrete intervals (10, 0.1 intervals or 100, 0.01 intervals, or more… you get the idea) to map the entire space with the desired resolution.

But what if the space is huge, and god forbid additional parameters are involved, like in any real-life modeling scenario?

Now we have to test not only the possible parameter values but also all their possible combinations i.e. the solution space expands exponentially, rendering a grid search computationally infeasible. Luckily, physicists have worked on the problem of efficient sampling, and advanced algorithms exist today (eg. Metropolis-Hastings MCMC, Variational Inference) that are able to quickly explore high dimensional spaces of parameters and find convex points. You don’t have to code these complex algorithms yourself either, probabilistic computing languages like PyMC or STAN make the process highly streamlined and intuitive.

STAN

STAN is my favorite as it allows interfacing with more common Data Science languages like Python, R, Julia, MATLAB etc. aiding adoption. STAN relies on state-of-the-art Hamiltonian Monte Carlo sampling techniques that virtually guarantee reasonably-timed convergence for well specified models. In my next article, I will cover how to get started with STAN for simple as well as not-no-simple regression models, with a full python code walkthrough. I will also cover the full Bayesian modeling workflow, which involves model specification, fitting, visualization, comparison, and interpretation.

Follow & stay tuned!

The post Bayesian Data Science: The What, Why, and How appeared first on Towards Data Science.

]]>

Using Bayesian Modeling to Predict The Champions League

Tue, 20 Feb 2024 15:11:32 +0000

Sports Analytics

Photo by Anders Krøgh Jørgensen on Unsplash

Oh, the Champions League. Possibly the competition that attracts the most fans regardless of the team they support. It’s the best against the best. The show is almost guaranteed… And the outcome is almost impossible to predict.

But that shouldn’t stop us from trying.

I was the other day going through old college assessments and found one that inspired me to write this post, where we’ll use Bayesian Inference to create a model that tries to predict the next Champions League matches: the first leg in the Round of 16 (well, it can be used in any match from any round, to be honest).

The aim is to build a model through Bayesian Statistics applied to a real-case scenario that I find to be interesting and entertaining.

Whether you know about Bayesian Inference or not, this post is for you. Even if you already knew about everything I’m about to share, the final predictions will at least serve you to either congratulate or laugh at me after the first leg is played.

Here’s what we’ll go through today:

Bayesian Inference & Modeling
The Data
Building the Model
Validating the Model
Predictions
Conclusion, Improvements, and Future Steps

The full code is on my GitHub page which you’ll find in the resources section[1].

Bayesian Inference & Modeling

If you’re new to this, you’ll need a proper intro with the basic math fundamentals, which is what’s coming next. If you want to go straight to the point, feel free to skip this section.

As per Wikipedia, "Bayesian Inference is a method of statistical inference in which Bayes’ theorem is used to update the probability for a hypothesis as more evidence or information becomes available."[2]

That’s the formal definition, but there’s a lot of information here. I think it’s better if we dissect this sentence into different parts:

"Bayesian Inference is a method of statistical inference" – Statistical inference is the process of using data analysis and statistics to infer properties of a population, by hypothesis testing and deriving estimates.
"in which Bayes’ theorem is used to update the probability for a hypothesis" – Bayes’ theorem[3] describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
"as more evidence or information becomes available" – Different from frequentist statistics, the data (here, evidence or information) is not the only factor we take into account, it is only used to update our prior beliefs.

To summarize: Bayesian Inference is the process of inferring properties from a certain population based on Bayes’ theorem by using the prior knowledge we possess (as a distribution) and the data to get the final distribution.

Lots of concepts, and many more still missing, but I won’t go in-depth today. However, we can’t ignore the Bayes’ theorem formula, as it’s the core of the whole process:

Bayes’ theorem – Image generated by the author

The formula itself is really simple and chances are you’ve already seen it before. If you haven’t, just take into account that P(A|B) is read as the probability of the event "A" happening given the event "B".

For example, we could read P(rain|clouds) as the probability of rain given that there are clouds in the sky.

Keeping up with the example, where A is "rain" and B is "clouds", the formula would be:

Bayes’ theorem using rain/cloud examples – Image by the author

Using this formula we could compute the probability of experiencing rain given that the sky is cloudy. For that, we’d have to know the probability of having a cloudy sky when it’s raining (P(clouds|rain)), the probability of rain at that given moment (P(rain)), and the probability of seeing clouds in the sky (P(clouds)).

But this equation can also be expressed in terms of posterior, likelihood, prior and evidence:

Posterior: P(rain | clouds) – is what we’re trying to infer.
Likelihood: P(clouds | rain) – is the probability of observing the data that has been observed (clouds in the sky) assuming that the data came from a specific scenario (rain).
Prior: P(rain) – The previous information we have about the event we’re interested in.
Evidence: P(clouds) – It’s the data itself.

Bayes’ theorem in other terms – Image by the author

The evidence itself is only used to normalize the density and make it a valid probability, but it’s just a constant. For that reason, it’s sometimes omitted because most of the time we’ll be interested in the posterior distribution only, not the density probability. And the final formula we’ll see is

Bayes’ theorem in proportional terms – image by the author

meaning that the posterior distribution is proportional to the likelihood times the prior.

Modeling, on its own, refers to the process of finding the statistical model that governs the behavior of any real process.

The Data

We’ve managed to survive the math, now it’s time to dig into the practical case and data. We’re trying to predict the winner of a given match so we’ll define θ as our hypothesis. We know it’s a number and it can either be 1 (the team won) or 0 (the team didn’t win).

So, we can assume that θ follows a Bernoulli distribution.[4] We don’t know the "p" parameter, and that’s what we’ll try to infer using Bayes’ theorem.

Note that θ isn’t linear. It’s, in fact, from the exponential family and, when this happens, we say we are dealing with a generalized linear model (GLM). In our case, where our variable follows a Bernoulli distribution, we’ll present the logistic regression.

Logistic regression is a model in which a certain variable can only be either 0 or 1, meaning success or failure where "p" is the success probability. So

Our variable’s distribution – Image by the author

where i = 1,…,n are all the matches we have times 2.

As the distribution isn’t linear, we need a link function that provides the relationship between the linear predictor and the mean of the distribution function. It depends on the distribution we’re dealing with but the link function for our case is the logit:

Link function for our Bernoulli variable – Image by the author

We can now provide a linear shape to this logit function:

Modelized p – Image by the author

And these "x" are the data we’re going to use. So Bernoulli’s variable has now become a sort of linear one, which is amazing.

Our model today will be based on the premise of simplicity. The data we’ll be using will only consist of three parameters: the number of goals the team scores, whether they play at home or not, and the average player ratings.

Keeping up with the simplicity, we’ll only be using the team’s data to predict its chances of winning, without caring about the rival. You’ll see why in the next section.

The catch is that we’re only going to use data from the current Champions League campaign (2023–24) – from the group phase. Obtaining open-source football data can be tricky nowadays so we’re going to use only a very limited set of matches to build our model, and we’ll compensate for that with Markov Chain Monte Carlo (MCMC) approximations.

Talking about what MCMC is, is outside of this post’s scope. But I’ll put a link in the resources section to an article that explains it incredibly well, named "A Gentle Introduction to Markov Chain Monte Carlo for Probability"[5].

Building The Model

Enough intro and theoretical comments. Let’s go straight to the code.

I had two options: Python or R. I chose the latter because I especially like the rjags library for the task we’re about to go through. However, you could easily do the same with Python either from scratch or using any other library you’d like.

JAGS[6] comes from Just Another Gibbs Sampler, and it’s a program for simulation from Bayesian hierarchical models using MCMC. Perfect for our scenario.

So what we’ll do first is build the model:

modelString = "
model{
   # Likelihood
   for(i in 1:n){
       y[i] ~ dbern(pi[i])
       logit(pi[i]) <- beta[1] + beta[2]*goals[i] + beta[3]*is_home[i] + 
                       beta[4]*avg_rating[i]
   }

   # Prior distributions
   for (j in 1:4) {
       beta[j] ~ dnorm(0.0, 0.0001)
   }
}"
writeLines( modelString , con="cl_model.txt" )

In the model, which we store as a string and then into a TXT file, what we do is define the likelihood and the prior distributions:

Likelihood: in a for loop that goes through all the data rows we have (192, which comes from 96 matches times 2 teams per match) and assigns the Bernoulli distribution to the predictor variable for that sample. We then compute the logit function which, as you’ll recall, is a linear equation using the three stats we’re using to predict.
Prior distributions: we don’t know anything about the linear parameters so we’re just making them follow a normal distribution with a standard deviation of 100 and mean 0.

What we do next is initialize the beta values using a normal distribution:

init_values <- function(){
  list(beta = rnorm(4, 0, 3))
}
init_values()

Initialized beta values – Image by the author

Time to import the data. If you’ve read my articles before, you know I’m a DuckDB fan[7] and there’s where I have all the data stored. As the SQL query itself isn’t relevant – only the data we extract – I’ll omit this step and show the data straightaway:

First 10 rows of data – Screenshot by the author

We’re seeing one row per team and game, with each team’s stats and whether they won or not. For simplicity, the avg_rating feature is rounded.

To create the JAGS model, we’ll need to provide the data. Not just the covariables used to predict but also some variables we’re using in the model like the outcome column y or the static number of observations n. To do so, we just create a list of values like this:

jags_data <- list(y = data$win,
                  goals = data$goals,
                  is_home = data$is_home,
                  avg_rating = data$avg_rating,
                  n = nrow(data))

We have our model defined, prepared the function to initialize values (and executed it) and we already have the data in the proper JAGS format. It’s now time to run the model and find the proper beta values.

JAGS uses different MCMC samplers to approximate the posterior distribution and automatically chooses the proper one, but it needs a prior "adaptation" period. Here’s how we do it:

jagsModel <- jags.model(file = "cl_model.txt",
                        data = jags_data,
                        inits = init_values,
                        n.chains = 3,
                        n.adapt = 1000)

This jags.model function needs the model file, the data we just prepared, and the function that initializes values. Then, we set the number of Markov chains to 3 and the number of iterations for adaptation to 1000.

It then shows some info relevant to the model and its nodes, but we don’t have to do anything with it.

To ensure convergence, we’ll apply what’s called "burn-in", which is only a way to say that we discard a certain number of iterations that we don’t know if fall within the posterior region or not. It’s as easy as running the next function:

update(jagsModel, n.iter=10000)

Lastly, we produce the inference by telling JAGS to generate MCMC samples (these samples will then be used to generate the posterior distribution). To do so, we use coda.samples() as follows:

samples <- coda.samples(jagsModel,
                        variable.names = c("beta"),
                        n.iter = 10000,
                        thin = 10)

We’re basically using the model as the first parameter, we then specify the parameters the values of which want to record (beta), and set the number of steps for each chain to 10.000.

The last parameter, thin, is crucial here. As we’re working with Markov Chains, we expect to have autocorrelated samples. For that reason, it’s common to use thinning to only keep 1 every X iterations for our final samples. In this case, I’m using X=10.

Now run the code and let it produce the samples.

Validating the Model

We’re almost there!

We’ve already got our samples. Let’s now check if everything’s fine. To do so we’ll examine the chains. Do you remember that we decided to use 3 and not 1? It’s only for pure comparison purposes.

We’re going to use here an R module created by Kruschke, J. K. (2015) called DBDA2E-utilities.R[8]. From it, we’ll first take advantage of the diagMCMC() function by passing the samples as the first parameter and then the parameter name as the second. We’ll try, for example, with the second one (beta[2]):

diagMCMC(codaObject = samples, parName = "beta[2]")

Chain examinations – Screenshot by the author

This is what we see in each of these graphs:

Top-left: Trace plot to see if the chains converge and none is left orphan.
Bottom-left: We see how the Gelman-Rubin statistic evolves with each iteration. If you’re not used to it, just take into account that we want it as close to 1 as possible.
Top-right: Autocorrelation is shown here and, as we want as little of it as possible, we want to see the lines going down as much as possible.
Bottom-right: Density plot to see if the three chains are well superimposed or not.

We’re now confident that our chains look fine. Let’s see the actual posterior distributions, shall we?

From the same file module we just used, we’ll now take advantage of the plotPost() function:

plotPost( samples[,"beta[2]"])

plotPost() output – Screenshot by the author

For now, everything looks reasonably fine. We’re ready to start predicting.

Predictions

Suppose we want to predict Napoli vs Barcelona. To compute the distributions of each team’s winning probabilities, we would need to know in advance the match’s predicted goals and the predicted average player ratings. There’s no way to know these before the match gets played, but we could use two strategies:

Build different models to predict these variables (neural networks, for example).
Use the last-X matches average from these stats.

Today, we’ll go with the second one, and we’ll use the 5-game average to predict the winning chance distributions. This adds value to our predictions because it takes into account the team’s current form.

This is the DF we’re now going to use, which now contains data from all competitions, not just the Champions League:

Random sample of averaged data – Screenshot by the author

As our model predicts team by team, we’ll have to perform two predictions and overlap both histograms into one single plot. Let’s start with a real example: RB Leipzig vs Real Madrid.

I’m finishing this post as of February 15th and this game was just played two days ago, so using it as an example is good to see if the predictions are accurate or not. To be clear: data from yesterday’s matches isn’t used for these predictions.

We need to filter the data from this DF and get each team’s:

# Home team data
home_team = 'RBL'
home_pred_home <- 1
goals_pred_home <- avg_data[avg_data$team == home_team, 'goals_avg']
avg_rating_pred_home <- avg_data[avg_data$team == home_team, 'rating_avg']

# Away team data
away_team = 'Real Madrid'
home_pred_away <- 0 
goals_pred_away <- avg_data[avg_data$team == away_team, 'goals_avg']
avg_rating_pred_away <- avg_data[avg_data$team == away_team, 'rating_avg']

Now that we have valid values for each variable, we need to run the predictions. Recall the logit function allowed us to convert this problem into a linear one, and we found the coefficients through JAGS. Now it’s time to use these and the averages to get the linear predictor values:

predictor_pred_home <- samples[[1]][,1] +
  samples[[1]][,2]*goals_pred_home +
  samples[[1]][,3]*home_pred_home  +
  samples[[1]][,4]*avg_rating_pred_home

predictor_pred_away <- samples[[1]][,1] +
  samples[[1]][,2]*goals_pred_away +
  samples[[1]][,3]*home_pred_away  +
  samples[[1]][,4]*avg_rating_pred_away

And finally, we compute the inverse of the logit function to find the final pi estimate:

pi_pred_home <- as.numeric(exp(predictor_pred_home)/(1+exp(predictor_pred_home)))
pi_pred_away <- as.numeric(exp(predictor_pred_away)/(1+exp(predictor_pred_away)))

To grasp the full value of these predictions, we’ll now plot them into a histogram:

preds <- data.frame(pi_pred_home, pi_pred_away)
names(preds) <- c(home_team, away_team)

ggplot(preds) +
  geom_histogram(aes(x = pi_pred_home, y = ..density.., 
                     color = home_team, fill=home_team),
                 bins = 30, alpha=0.5) +
  geom_histogram(aes(x = pi_pred_away, y = ..density..,     
                     color = away_team, fill=away_team),
                 bins = 30, alpha=0.5) +
  theme_light() +
  xlim(c(0,1)) +
  xlab(expression(pi)) +
  theme(axis.text = element_text(size=15),
        axis.title = element_text(size=16))

And the result:

Leipzig vs Madrid predictions – Screenshot by the author

How do we interpret this? The most direct way to do it is by using the median of both teams’ predictions and comparing them both relatively:

For RB Leipzig, the median is at 37.07%.
For Real Madrid, the median is at 58.84%.

From these numbers, we could conclude that both teams have been doing quite well recently but Madrid seems to be better though not by much. So the model is inclined towards Madrid but without much certainty.

Taking into account that the match ended 0–1 yesterday, we can be extremely proud of our model.

Copenhagen vs City (1–3), Lazio vs Bayern (0–1), and PSG vs Real Sociedad (2–0) have already been played, so we can use these to assess our model’s accuracy as well:

Already-played games predictions – Screenshot by the author

This is impressive. The model not only predicted these games with 100% accuracy, but it was able to predict Lazio’s higher winning chances vs Bayern even if it was the underdog and the odds were clearly against them.

Almost everyone would have betted for Bayern as a favorite, even myself, but the model didn’t and Lazio ended up winning 1–0.

So far, 4/4 correctly predicted. So let’s predict the 4 remaining matches being played next week!

Next week’s predictions – Screenshot by the author

These look reasonable again:

Inter vs Atlético de Madrid: Inter is the clear favorite (personally not so sure about it).
PSV vs Dortmund: No clear favorite but Dortmund seems to be in better form and its winning chances seem higher… But a draw wouldn’t be unexpected.
Porto vs Arsenal: Arsenal is, without a doubt, the model’s favorite. We see Arsenal’s distribution being thin and tall, meaning the confidence in that prediction is higher, while Porto’s is quite wide, meaning the model is not so sure about what it’s predicting and the true value lies within 25% and a 70% chance of winning for the Portuguese team.
Napoli vs Barcelona: Similar to the case above, the model doesn’t see Napoli winning the game, while with Barça it can’t decide with confidence its chances of winning, being within 50% and 80% (being myself a Barça fan, that’s too high in my opinion)

Now we just have to wait, enjoy the matches, and come back to see how many predictions our Bayesian model got right!

Conclusions, Future Steps, and Improvements

What we built here was rather simple… Yet we can say the predictions were reasonable. More than that, probably, having seen that it was accurate on all played matches already, being able to predict one underdog’s win.

However, this model is too simple to expect great predictions from it. For example, the Napoli vs Barcelona prediction favors Barça because the model doesn’t take into account goals conceded and that’s the team’s major flaw this season.

That’s why I wanted to dedicate a final section to suggest just a few potential improvements to turn this model into a high-class one:

Expand the feature set: a proper feature analysis should be performed to assess which features we’ve left behind would be great parameters for our model. For example: win%, possession, shots, tackles, xG…
Include rival data: the current model predicts the team’s winning distribution based on the team’s stats only, without taking into account the rivals they face. If we added the other team’s info, the model would be more complete.
Use more data: Even if we used MCMC, another solution would have been to use a lot more data from past CL seasons to create the model and make it (potentially) more robust.
Create a machine learning model to predict the team’s features instead of using simple averages.

Many others could be added to this list. Feel free to implement these on your own!

Thanks for reading the post! 

I really hope you enjoyed it and found it insightful. There's a lot more to 
come, especially more AI-based posts I'm preparing.

Follow me and subscribe to my mail list for more 
content like this one, it helps a lot!

@polmarin

Resources

[1] Pol Marín (2024). Bayesian Inference in Champions League. https://github.com/polmarin/bayesian_inference_champions_league

[2] Wikipedia contributors. (2024, January 31). Bayesian inference. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=1201216717

[3] Wikipedia contributors. (2024, January 12). Bayes’ theorem. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bayes%27_theorem&oldid=1195202672

[4] Wikipedia contributors. (2024, January 25). Bernoulli distribution. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Bernoulli_distribution&oldid=1199096378

[5] Jason Brownlee (2019, September 25). A Gentle Introduction to Markov Chain Monte Carlo for Probability. In Machine Learning Mastery. https://machinelearningmastery.com/markov-chain-monte-carlo-for-probability/

[6] JAGS – Just Another Gibbs Sampler. https://mcmc-jags.sourceforge.io/

[7] Pol Marín (2023, March 16). Forget about SQLite, Use DuckDB Instead – And Thank Me Later. In Medium. https://towardsdatascience.com/forget-about-sqlite-use-duckdb-instead-and-thank-me-later-df76ee9bb777

[8] Kruschke J. K. (2015). GitHub –https://github.com/kyusque/DBDA2E-utilities/

The post Using Bayesian Modeling to Predict The Champions League appeared first on Towards Data Science.

]]>

How Reliable Is a Ratio?

Fri, 08 Dec 2023 05:56:07 +0000

Introduction

One of my references in the Data Science field is Julia Silge. On her Tidy Tuesday videos she always makes a code-along type of video teaching/ showing a given technique, helping other analysts to upskill and incorporate that to their repertoire.

Last Tuesday, the topic was Empirical Bayes (her blog post), which caught my attention.

But, what is that?

Empirical Bayes

Empirical Bayes is a statistical method used when we work with ratios like [success]/[total tries]. When we are working with such variables, many are the times when we face a 1/2 success, which translates to a 50% success percentage, or 3/4 (75%), 0/1 (0%).

Those extreme percentages do not represent the long term reality because there were so little tries that it makes it very hard to tell if there is a trend there, and most times these cases are just ignored or deleted. It takes more tries to tell what the real success rate is, like 30/60, 500/100, or whatever makes sense for a business.

Using Empirical Bayes, though, we are able to use the current data distribution to calculate an estimate for its own data in earlier or later stages, as we will see next in this post.

We use the data distribution to estimate earlier and later stages of each observation’s ratio.

Analysis

Let’s jumps to the analysis. The steps to follow are:

Load the data
Define success and calculate the success ratio
Determine the distribution’s parameters
Calculate Bayes estimates
Calculate the Credible Interval

Let’s move on.

Imports

# Imports
import pandas as pd
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from distfit import distfit

Dataset

The dataset to be used in this post is a log of customers and their quantities of e-commerce orders and brick and mortar orders. It was generated based on real numbers, but all the dataset was reduced, randomized and modified in proportions to become anonymized to be used publicly.

The variables are:

customer: Customer ID
_brickmortar: How many transactions have happened in physical store .
ecomm: How many e-commerce transactions have happened online.

Success Ratio

Suppose our company is willing to increase their online penetration. Therefore, for this analysis, our success is defined as when the customer purchases online. The customer bought online [success], bought brick and mortar [not success].

This way, our metric can be a simple ecommerce transactions over the total _[ecomm]/[ecomm + brickmortar]. It will give us a ratio. Our question now becomes: how different are these two customers: 808 vs. 672.

808 has a ratio of 1/11 (9%) and 672 has a ratio of 1/2 (50%). Which one is better? Where should I invest more time for growth? Well, customer 672 has 50% of penetration, but it’s just 2 transactions total with us. Customer 808 has a longer history with the company, but has just started buying online.

Let’s calculate all the ratios. Here, we are going to discard the customers that never bought online, simply because it is not possible to calculate an estimate out of that.

# Creating success ratio
df2c = (
    df
    .query('ecomm > 0 & brick_mortar > 0')
    .assign(total_trx = df['brick_mortar'] + df['ecomm'],
            ratio = df['ecomm']/(df['brick_mortar'] + df['ecomm'])  )         
)

display(df2c)

And we have the result.

Next, let’s find out the distribution parameters.

Find Out the Distribution Parameters

The next step here is to calculate the parameters of the distribution of our ratios. But before we jump into that, let’s build our intuition on why we even need to perform this step.

Well, the Thomas Bayes theorem says that probability is conditional. So, **** the likelihood of something occurring should be based on a previous outcome having occurred in similar circumstances.

If something happened 10 times, it is more likely to happen an 11th time than something that failed to happen 10 times to happen on the 11th try.

We will use the beta distribution to represent our prior expectations, or what has been happening so far. It is like looking at our data and understanding the pattern to be able to give a more precise estimate for any data point in any stage, considering it would follow that same distribution.

To know which α and β makes the beta distribution to fit our data, we will use the module distfit in Python.

Looking at the distribution, we can just plot a histogram.

# Looking at the distribution
px.histogram(df2c,
             x='ratio', nbins=20,
             template="simple_white", width = 1000,
             title='Histogram of the Ratio Brick & Mortar vs e-Comm')

Now, let’s see which parameters make the beta distribution fit the data.

# Our distribution
X = df2c['ratio']

# Alternatively limit the search for only a few theoretical distributions.
dfit = distfit(method='parametric', todf=True, distr=['beta'])

# Fit model on input data X.
dfit.fit_transform(X)

# Print the bet model results.
dfit.model
--------------
[OUT]
[distfit] >INFO> fit
[distfit] >INFO> transform
[distfit] >INFO> [beta] [0.11 sec] [RSS: 0.823271] [loc=0.011 scale=0.941]
[distfit] >INFO> Compute confidence intervals [parametric]
{'name': 'beta',
 'score': 0.8232713059833795,
 'loc': 0.011363636363636362,
 'scale': 0.9411483584238516,
 >>>'arg': (0.850939343634336, 1.4553354599535102),<<<
 'params': (0.850939343634336,
  1.4553354599535102,
  0.011363636363636362,
  0.9411483584238516),
 'model': <scipy.stats._distn_infrastructure.rv_continuous_frozen at 0x7838c17ad570>,
 'bootstrap_score': 0,
 'bootstrap_pass': None,
 'color': '#e41a1c',
 'CII_min_alpha': 0.030238213140192628,
 'CII_max_alpha': 0.8158034848017729}

I marked this line >>>arg: (0.850939343634336, 1.4553354599535102),<<< , that are the α and β we need.

In case we want to see the fit, use this code, where we are just creating a figure with 3 subplots and adding the Probability Density Function, the Cumulative Density Function and the QQ Plot.

import matplotlib.pyplot as plt
# Create subplot
fig, ax = plt.subplots(1,3, figsize=(18, 7))
# Plot PDF with histogram
dfit.plot(chart='PDF', ax=ax[0])
# Plot the CDF
dfit.plot(chart='CDF', ax=ax[1]);
dfit.qqplot(X, ax=ax[2])

Ok. Time to calculate our estimates.

Bayes Estimates

To calculate the estimates, we need to set the alpha (a) and beta(b) as just discovered.

# Alpha and Beta values
a,b = 0.850939343634336, 1.4553354599535102

With the estimated parameters of the Beta Distribution, let’s model our ratio of e-Commerce transactions with Empirical Bayes. We’ll start with our overall prior, and update based on the individual evidence. The calculation is simply adding a to the number of successes (e-Comm orders) , and a+b to the total number of orders.

# Calculating Bayes Estimates
df2c = (df2c
        .assign(emp_bayes = (df2c['ecomm'] + a)/(df2c['total_trx'] + a + b) )
        )

Nice! Now we can already plot the simple ratio against the estimated ratio and see how the model is working.

# Scatterplot
fig = sns.scatterplot(data = df2c,
                 x= 'ratio', y= 'emp_bayes',
                 s=50)

# Slope line
plt.plot([0,1], [0,1], linestyle='--', color='red');

Which results in the graphic.

Here, the more the dots get close to the red line, the more reliable that ratio is. It means that the simple ratio [success]/[total] is very close to the Bayes estimate, which already takes in account the prior probability.

Look at some numbers in the table.

Looking at the table, we notice that the higher the number of e-commerce transactions and total transactions, the closer the Bayes estimate is from the real ratio. This means that those 1/2 cases will be really off the mark, therefore, much less reliable as a trend.

We can also calculate the confidence interval for the estimates. That’s what we will do next.

Confidence Interval

The confidence interval is an extra step to make our analysis even more complete. We can give the decision maker a range of trust.

First, let’s calculate the alpha 1 and beta 1 for each observation, which are the posterior shape parameters for the beta distribution, after the prior has been updated.

# Calculate a1 and b1
df2 = (df2c
       .assign(a1 = df2c.ecomm + a,
               b1 = df2c.total_trx - df2c.ecomm + b,)
)

Then, the next task can be performed with scipy.stats.beta module, using the method interval. The output for a 95% confidence interval is a tuple with two numbers, where the index [0] is the low boundary and the [1] is the high.

# Calculating the CI and add to the dataset
df2 = (df2
          .assign( low = scs.beta.interval(.95, df2.a1, df2.b1)[0],
                   high = scs.beta.interval(.95, df2.a1, df2.b1)[1] )
          .sort_values('total_trx', ascending=False)
           )

And, finally, plotting the interval by observation. We create a figure, then a scatterplot for the middle point and two bar graphics, one for the low range, which is white (so it disappears on white background) and the other one for the high range.

# customer ID to string for better visualization
df2['customer'] = df2.customer.astype('string')

# Number of customers to plot
n_cust = 100

# Plot
plt.figure(figsize=(25,7))
plt.scatter(df2.customer[:n_cust], df2.emp_bayes[:n_cust], color='black', s=100 )
plt.bar(df2.customer[:n_cust], df2.low[:n_cust], label='low', color='white', zorder=1 )
plt.bar(df2.customer[:n_cust], df2.high[:n_cust], label='high', zorder=0 )
plt.legend();

Here is the result.

This graphic is ordered by decreasing quantity of e-comm orders. From more (left) to less (right), we notice that the size of the confidence interval only gets bigger, showing that when we have too little number of e-comm transactions, it is really difficult to predict a reliable estimate.

Coming back to one of our questions: should I invest more time in growing e-comm for customer 672 or 808, here are the intervals.

The customer 672 has a much larger CI, so it is less reliable. We should spend more time developing customer 808, which aligns with the history of relationship with the brand.

Before You Go

Well, this tool shows itself as powerful. We see that it brought good results for our comparison case here. Many times, we will just discard customers with little transactions. But many other times, that may not be possible. Furthermore, we may want to in fact analyze the tiny clients. Or yet, it will help us comparing two clients that are ordering a lot too, looking at their potential, based on the CI.

Empirical Bayes is a good idea for larger data, it is an extra variable for other models, maybe.

Well, I bet you are probably already thinking about the utility of this in your job. So, we end here.

If you liked this content, follow my blog, subscribe to my page and get the posts once they’re published.

Gustavo Santos – Medium

Also, find me on LinkedIn.

Here is the complete code, for reference.

Studying/Python/statistics/Empirical_Bayes.ipynb at master · gurezende/Studying

Reference

Empirical Bayes for #TidyTuesday Doctor Who episodes | Julia Silge

Understanding empirical Bayes estimation (using baseball statistics)

Understanding credible intervals (using baseball statistics)

What is the intuition behind beta distribution?

scipy.stats.beta – SciPy v1.11.4 Manual

The post How Reliable Is a Ratio? appeared first on Towards Data Science.

]]>

Chat with Your Dataset using Bayesian Inferences.

Mon, 13 Nov 2023 04:53:03 +0000

The ability to ask questions to your data set has always been an intriguing prospect. You will be surprised how easy it is to learn a local Bayesian model that can be used to interrogate your data set.

Photo by Vadim Bogulov on Unsplash

With the rise of chatGPT-like models, it has become accessible for a broader audience to analyze your own data set and, so to speak, "ask questions". Although this is great, such an approach has also disadvantages when using it as an analytical step in automated pipelines. This is especially the case when the outcome of models can have a significant impact. To maintain control and ensure results are accurate we can also use Bayesian inferences to talk to our data set. In this blog, we will go through the steps on how to learn a Bayesian model and apply do-calculus on the data science salary data set. I will demonstrate how to create a model that allows you to "ask questions" to your data set and maintain control. You will be surprised by the ease of creating such a model using the bnlearn library.

Introduction

Extracting valuable insights from data sets is an ongoing challenge for data scientists and analysts. ChatGPT-like models have made it easier to interactively analyze data sets but at the same time, it can become less transparent and even unknown why choices are made. Relying on such black-box approaches is far from ideal in automated analytical pipelines. Creating transparent models is especially important when the outcome of a model is impactful on the actions that are taken.

The ability to communicate effectively with data sets has always been an intriguing prospect for researchers and practitioners alike.

In the next sections, I will first introduce the bnlearn library [1] on how to learn causal networks. Then I will demonstrate how to learn causal networks using a mixed data set, and how to apply do-calculus to effectively query the data set. Let’s see how Bayesian inference can help us to interact with our data sets!

If you find this article helpful, you are welcome to follow me because I write more about Bayesian learning. If you are thinking of taking a Medium membership, you can support my work a bit by using my referral link. It is the same price as a coffee but this allows you to read unlimited articles every month.

The Bnlearn library

Bnlearn is a powerful Python package that provides a comprehensive set of functions for causal analysis using Bayesian Networks. It can handle both discrete, mixed, and continuous data sets, and offers a wide range of user-friendly functionalities for causal learning, including structure learning, parameter learning, and making inferences [1–3]. Before we can make inferences, we need to understand structure learning and parameter learning because it relies on both learnings.

Learning the causal structure of a data set is one of the great features of bnlearn. Structure learning eliminates the need for prior knowledge or assumptions about the underlying relationships between variables. There are three approaches in bnlearn to learn a causal model and capture the dependencies between variables. Structure learning will result in a so-called Directed Acyclic Graph or DAG). Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy. Read more details regarding structure learning in the underneath blog.

Score-based structure learning: Using scoring functions BIC, BDeu, k2, bds, aic, in combination with search strategies such as exhaustivesearch, hillclimbsearch, chow-liu, Tree-augmented Naive Bayes (TAN), NaiveBayesian.
Constraint-based structure learning (PC): Using statistics such chi-square test to test for edge strengths prior the modeling.
Hybrid structure learning: (the combination of both techniques)
Score-based, Constraint-based, and Hybrid structure learning. Although all three techniques will result in a causal DAG, some can handle a large number of features while others have higher accuracy. Read more details regarding structure learning in the underneath blog [2].

A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python.

Parameter learning is the second important part of Bayesian network analysis, and bnlearn excels in this area as well. By leveraging a set of data samples and a (pre-determined) DAG we can estimate the Conditional Probability Distributions or Tables (CPDs or CPTs). For more details about parameter learning, I recommend the following blog:

A step-by-step guide in designing knowledge-driven models using Bayesian theorem.

Bnlearn also provided a plethora of functions and helper utilities to assist users throughout the analysis process. These include data set transformation functions, topological ordering derivation, graph comparison tools, insightful interactive plotting capabilities, and more. The bnlearn library supports loading bif files, converting directed graphs to undirected ones, and performing statistical tests for assessing independence among variables. In case you want to see how bnlearn performs compared to other causal libraries, this blog is for you:

The Power of Bayesian Causal Inference: A Comparative Analysis of Libraries to Reveal Hidden…

In the next section, we will jump into making inferences using do-calculus with hands-on examples. This allows us to ask questions to our data set. As mentioned earlier, structure learning and parameter learning form the basis.

Interrogating data sets requires making inferences using do-calculus.

When we make inferences using do-calculus, it basically means that we can query the data set and "ask questions" to the data. To do this, we need two main ingredients: the DAG and the CPTs that are assigned to each node in the graph. The CPTs contain the probabilities of each variable and capture the causal relationships given to its parents. Let’s move on and create an example where we can see how it really works.

Application with the Data Science Salary Dataset

For demonstration, we will use the data science salary data set that is derived from ai-jobs.net [5]. The salary data set is collected worldwide and contains 11 features for 4134 samples. If we load the data, we can explore the columns and set features as continuous or category. Note that the model complexity increases with the number of categories which means that more data and computation time is required to determine a causal DAG.

# Install datazets.
!pip install datazets

# Import library
import datazets as dz
# Get the Data Science salary data set
df = dz.get('ds_salaries')
# The features are as following
df.columns
# 'work_year'          > The year the salary was paid.
# 'experience_level'   > The experience level in the job during the year.
# 'employment_type'    > Type of employment: Part-time, full time, contract or freelance.
# 'job_title'          > Name of the role.
# 'employee_residence' > Primary country of residence.
# 'remote_ratio'       > Remote work: less than 20%, partially, more than 80%
# 'company_location'   > Country of the employer's main office.
# 'company_size'       > Average number of people that worked for the company during the year.
# 'salary'             > Total gross salary amount paid.
# 'salary_currency'    > Currency of the salary paid (ISO 4217 code).
# 'salary_in_usd'      > Converted salary in USD.

Complexity is a major limitation

When features contain many categories, the complexity grows exponentially with the number of parent nodes associated with that table. In other words, when you increase the number of categories, it requires a lot more data to gain reliable results. Think about it like this: when you split the data into categories, the number of samples within a single category will become smaller after each split. The low number of samples per category directly affects the statistical power. In our example, we have a feature job_title that contains 99 unique titles for which 14 job titles (such as data scientists) contain 25 samples or more. The remaining 85 job titles are either unique or seen only a few times. To make sure this feature is not removed by the model because lack of statistical power, we need to aggregate some of the job titles. In the code section below we will aggregate job titles into 7 main categories. This results in categories that have enough samples for Bayesian modeling.

# Group similar job titles
titles = [['data scientist', 'data science', 'research', 'applied', 'specialist', 'ai', 'machine learning'],
          ['engineer', 'etl'],
          ['analyst', 'bi', 'business', 'product', 'modeler', 'analytics'],
          ['manager', 'head', 'director'],
          ['architect', 'cloud', 'aws'],
          ['lead/principal', 'lead', 'principal'],
          ]

# Aggregate job titles
job_title = df['job_title'].str.lower().copy()
df['job_title'] = 'Other'
# Store the new names
for t in titles:
    for name in t:
        df['job_title'][list(map(lambda x: name in x, job_title))]=t[0]
print(df['job_title'].value_counts())
# engineer          1654
# data scientist    1238
# analyst            902
# manager            158
# architect          118
# lead/principal      55
# Other                9
# Name: job_title, dtype: int64

The next pre-processing step is to rename some of the feature names. In addition, we will also add a new feature that describes whether the company was located in USA or Europe, and remove some redundant variables, such as salary_currency and salary.

# Rename catagorical variables for better understanding
df['experience_level'] = df['experience_level'].replace({'EN': 'Entry-level', 'MI': 'Junior Mid-level', 'SE': 'Intermediate Senior-level', 'EX': 'Expert Executive-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].replace({'PT': 'Part-time', 'FT': 'Full-time', 'CT': 'Contract', 'FL': 'Freelance'}, regex=True)
df['company_size'] = df['company_size'].replace({'S': 'Small (less than 50)', 'M': 'Medium (50 to 250)', 'L': 'Large (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].replace({0: 'No remote', 50: 'Partially remote', 100: '>80% remote'}, regex=True)

import numpy as np

# Add new feature
df['country'] = 'USA'
countries_europe = ['SM', 'DE', 'GB', 'ES', 'FR', 'RU', 'IT', 'NL', 'CH', 'CF', 'FI', 'UA', 'IE', 'GR', 'MK', 'RO', 'AL', 'LT', 'BA', 'LV', 'EE', 'AM', 'HR', 'SI', 'PT', 'HU', 'AT', 'SK', 'CZ', 'DK', 'BE', 'MD', 'MT']
df['country'][np.isin(df['company_location'], countries_europe)]='europe'
# Remove redundant variables
salary_in_usd = df['salary_in_usd']
#df.drop(labels=['salary_currency', 'salary'], inplace=True, axis=1)

As a final step, we need to discretize salary_in_usd which can be done manually or using the discretizer function in bnlearn. For demonstration purposes, let’s do both. For the latter case, we assume that salary depends on experience_level and on the country. Read more in this blog [6] for more details. Based on these input variables, the salary is then partitioned into bins (see code section below).

# Discretize the salary feature.
discretize_method='manual'

import bnlearn as bn

# Discretize Manually
if discretize_method=='manual':
    # Set salary
    df['salary_in_usd'] = None
    df['salary_in_usd'].loc[salary_in_usd<80000]='<80K'
    df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=80000, salary_in_usd<100000)]='80-100K'
    df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=100000, salary_in_usd<160000)]='100-160K'
    df['salary_in_usd'].loc[np.logical_and(salary_in_usd>=160000, salary_in_usd<250000)]='160-250K'
    df['salary_in_usd'].loc[salary_in_usd>=250000]='>250K'
else:
    # Discretize automatically but with prior knowledge.
    tmpdf = df[['experience_level', 'salary_in_usd', 'country']]
    # Create edges
    edges = [('experience_level', 'salary_in_usd'), ('country', 'salary_in_usd')]
    # Create DAG based on edges
    DAG = bn.make_DAG(edges)
    bn.plot(DAG)
    # Discretize the continous columns
    df_disc = bn.discretize(tmpdf, edges, ["salary_in_usd"], max_iterations=1)
    # Store
    df['salary_in_usd'] = df_disc['salary_in_usd']
    # Print
    print(df['salary_in_usd'].value_counts())

The Final DataFrame

The final data frame has 10 features with 4134 samples. Each feature is a categorical feature with two or multiple states. This data frame is going to be the input to learn the structure and determine the causal DAG.

# work_year           experience_level  ... country salary_in_usd
# 0          2023           Junior Mid-level  ...     USA         >250K
# 1          2023  Intermediate Senior-level  ...     USA      160-250K
# 2          2023  Intermediate Senior-level  ...     USA      100-160K
# 3          2023  Intermediate Senior-level  ...     USA      160-250K
# 4          2023  Intermediate Senior-level  ...     USA      100-160K
#     ...                        ...  ...     ...           ...
# 4129       2020  Intermediate Senior-level  ...     USA         >250K
# 4130       2021           Junior Mid-level  ...     USA      100-160K
# 4131       2020                Entry-level  ...     USA      100-160K
# 4132       2020                Entry-level  ...     USA      100-160K
# 4133       2021  Intermediate Senior-level  ...     USA       60-100K
#
# [4134 rows x 10 columns]

Bayesian Structure Learning to estimate the DAG.

At this point, we have pre-processed the data set and we are ready to learn the causal structure. There are six algorithms implemented in bnlearn that can help with this task. We need to choose a method for which we do not need to have a target variable, and it needs to handle many categories. The available search strategies are:

The hillclimbsearch algorithm is a heuristic search method. It starts with an empty network and iteratively adds or removes edges based on a scoring metric. The algorithm explores different network structures and selects the one with the highest score.
The exhaustivesearch performs an exhaustive search over all possible network structures to find the optimal Bayesian network. It evaluates and scores each structure based on a specified scoring metric. While this method guarantees finding the best network structure, it can be computationally expensive for large networks due to the exponential growth of possibilities.
The constraintsearch incorporates user-specified constraints or expert knowledge into the structure learning process of a Bayesian network. It uses these constraints to guide the search and restrict the space of possible network structures, ensuring that the learned network adheres to the specified constraints.
The chow-liu algorithm is a method for learning the structure of a tree-structured Bayesian network. It calculates the mutual information between each pair of variables and constructs a tree by greedily selecting the edges that maximize the total mutual information of the network. This algorithm is efficient and widely used for learning the structure of discrete Bayesian networks but requires setting a root node.
The naivebayes algorithm assumes that all features in a dataset are conditionally independent given the class variable. It learns the conditional probability distribution of each feature given the class and uses Bayes theorem to calculate the posterior probability of the class given the features. Despite its naive assumption, this algorithm is often used in classification tasks and can be efficient for large datasets.
The TAN (Tree-Augmented Naive Bayes) algorithm is an extension of the naive Bayes algorithm that allows for dependencies among the features, given the class variable. It learns a tree structure that connects the features and uses this structure to model the conditional dependencies. TAN combines the simplicity of naive Bayes with some modeling power, making it a popular choice for classification tasks with correlated features. This method requires setting a class node.

The scoring types BIC, K2, BDS, AIC, and BDEU are used to evaluate and compare different network structures. As an example, BIC balances the model complexity and data fit, while the others consider different types of prior probabilities. In addition, the independence test prunes the spurious edges from the model. In our use case, I will use the hillclimbsearch method with scoring type BIC for structure learning. We will not define a target value but let the bnlearn decide the entire causal structure of the data itself.

# Structure learning
model = bn.structure_learning.fit(df, methodtype='hc', scoretype='bic')

# independence test
model = bn.independence_test(model, df, prune=False)
# Parameter learning to learn the CPTs. This step is required to make inferences.
model = bn.parameter_learning.fit(model, df, methodtype="bayes")
# Plot
bn.plot(model, title='Salary data set')
bn.plot(model, interactive=True, title='method=tan and score=bic')

Figure 1. After structure learning, the following causal DAG is learned.

Figure 2. An interactive plot of the causal DAG.

Chat With Your Data Set.

With the learned DAG (Figures 1 and 2), we can estimate the conditional probability distributions (CPTs, see code section below), and make inferences using do-calculus. Let’s start asking questions. Note that the results can (slightly) change based on the stochastic components in the model.

Question 1.

What is the probability of a job title given that you work in a large comany? _P(job_title | company_size=Large (>250))_

After running the code section below we can see that engineer scientist is the most likely outcome (P=0.34) followed by data scientist (P=0.26).

query = bn.inference.fit(model, variables=['job_title'],
                         evidence={'company_size': 'Large (>250)'})

# +----+----------------+-----------+
# |    | job_title      |         p |
# +====+================+===========+
# |  0 | Other          | 0.031616  |
# +----+----------------+-----------+
# |  1 | analyst        | 0.209212  |
# +----+----------------+-----------+
# |  2 | architect      | 0.0510425 |
# +----+----------------+-----------+
# |  3 | data scientist | 0.265006  |
# +----+----------------+-----------+
# |  4 | engineer       | 0.343216  |
# +----+----------------+-----------+
# |  5 | lead/principal | 0.0407967 |
# +----+----------------+-----------+
# |  6 | manager        | 0.0591106 |
# +----+----------------+-----------+

Question 2.

What is the probability of a salary range given a full time employment type, partially remote work, have the data science function at entry level and live in germany (DE)?

In the results below we can see our five salary categories for which the strongest posterior probability P=0.7 is a salary of <80K under these conditions. Note that other salaries also occur but they happen less frequently.

By changing the variables and evidence we can ask all kinds of questions. For example, we can now change experience level, residence, job title, etc, and determine how the probabilities are changing.

query = bn.inference.fit(model,
                         variables=['salary_in_usd'],
                         evidence={'employment_type': 'Full-time',
                                   'remote_ratio': 'Partially remote',
                                   'job_title': 'data scientist',
                                   'employee_residence': 'DE',
                                   'experience_level': 'Entry-level'})

# +----+-----------------+-----------+
# |    | salary_in_usd   |         p |
# +====+=================+===========+
# |  0 | 100-160K        | 0.0664068 |
# +----+-----------------+-----------+
# |  1 | 160-250K        | 0.0424349 |
# +----+-----------------+-----------+
# |  2 | 80-100K         | 0.117463  |
# +----+-----------------+-----------+
# |  3 | <80K            | 0.707087  |
# +----+-----------------+-----------+
# |  4 | >250K           | 0.0666078 |
# +----+-----------------+-----------+

Final words.

In this blog, we learned how to create a Bayesian model and how to ask questions to a mixed data set using inferences with do-calculus. With the use of bnlearn it becomes straightforward to setup such models and the models offer understandable and explainable results that can be easily embedded in data science pipelines.

Be Safe. Stay Frosty.

Cheers E.

Software

Let’s connect!

References

Taskesen, E. (2020). Learning Bayesian Networks with the bnlearn Python Package. (Version 0.3.22) [Computer software].
Taskesen E, A Step-by-Step Guide in detecting causal relationships using Bayesian Structure Learning in Python, Medium, 2021
Taskesen E, A step-by-step guide in designing knowledge-driven models using Bayesian theorem, Medium, 2021
Taskesen, E. (2020). The Power of Bayesian Causal Inference: A Comparative Analysis of Libraries to Reveal Hidden Causality in Your Dataset, Medium 2023.
https://ai-jobs.net/salaries/download/ (CC0: Public Domain)
Kay H. et al, Inferring causal impact using Bayesian structural time-series models, 2015, Annals of Applied Statistics (247–274, vol9)
Taskesen, E (2023), Create and Explore the Landscape of Roles and Salaries in Data Science. Medium.

The post Chat with Your Dataset using Bayesian Inferences. appeared first on Towards Data Science.

]]>

Beta Distributions: A Cornerstone of Bayesian Calibration

Sat, 28 Oct 2023 15:06:14 +0000

Photo by Google DeepMind on Unsplash

Hi there!

Distributions may not seem like a complex concept at first glance, but they are incredibly powerful and fundamental in the world of data analysis and statistics. Think about it this way: if you were to gather 50 shirts in various sizes and colors, you would have created a color distribution, a size distribution, and perhaps even a "how much does this shirt annoy you" distribution (jokingly, of course). The point is that as long as you have a category to measure, there’s a distribution waiting to be explored.

So, what exactly is a distribution? It’s essentially a way to show how a category spreads across a scale of probabilities or likelihoods. You can figure this out either from the data you have or from what you know about a particular topic. You’ve probably heard of terms like the normal distribution, skewed distribution, long-tailed distribution, and so on – each of these describes how data points are shaped.

Today I wanted to touch on the Beta Distribution and specifically its application in Bayesian Calibration. Bayesian Calibration is an approach that updates Bayesian inference with new data to find the best-fitting values for a model’s parameters. It considers both the prior information available about these parameters and the likelihood of the observed data given those parameters.

Before we dive into Bayesian Calibration with the Beta Distribution, let’s cover some technical details. Once we have those basics down, we’ll explore the Bayesian Calibration with Beta Distributions with an intriguing scenario.

Beta Distribution

The beta distribution, denoted as Beta(α, β), is a probability distribution characterized by two parameters. Its probability density function (pdf) is expressed as follows:

Image by Author

In this equation, both α and β represent the hyperparameters, and it’s important to note that they must always be greater than 0. Additionally, for the purpose of this article, we will focus on integer values exclusively.

Before we begin, let’s add a visual aid to see a colorful assortment of beta distribution PDFs with α and β ranging from 0.5 all the way to 50.

Image by Author

Now that we have a good idea of what a beta distribution looks like, let’s jump into a scenario.

Our Scenario

Our fictional company MM Manufacturing, is renowned for producing precision weights. Their system is top-notch, ensuring near-perfect calibration for the majority of their products. However, in recent times, an unusual issue has cropped up – a surge in customer complaints about weights that fall short of perfection. In response, MM Manufacturing introduced an additional layer of human verification to guarantee that every weight dispatched to customers is flawless.

But in order to analyze the trend in actual production weights, they tasked their Data Science team to analyze the likelihood of encountering these irregular weights and, more importantly, to monitor the distribution of such occurrences, in order to gain insight into the path to improved performance. Fortunately, all weights get their exact values recorded on a conveyor belt.

The Data Science team’s approach is rooted in Bayesian calibration. Every month, they update a beta distribution probability density function (pdf) to assess the anomaly in weights and how it has fared over time. To do this, they use data from the conveyor belt, serving as observations to ultimately determine the posterior. They also need to establish a prior, which can be based on historical data, domain knowledge, or a non-specific, uninformative distribution.

For the alpha (α) and beta (β) values in the beta distribution for the observable data or likelihood estimation, they opt for the following strategy:

α = Number of correctly calibrated weights + 1 (ensuring α > 0)

β = Number of incorrectly calibrated weights + 1 (ensuring β > 0)

As for their choice of prior, they initially select a uninformative one, represented by Beta(1,1) (uniform distribution as shown below), which minimizes influence of the prior on the posterior and place the primary reliance on observational data.

Image by Author

It maybe worthwhile to deviate a little bit to note the role of prior in this context

The Role of the Prior

In the realm of Bayesian inference, the prior is where you can incorporate your perspective – well, not quite your opinion, but rather your informed perspective alongside previous observations. It comes in various forms, ranging from highly informative to completely uninformative, and it plays a crucial role in shaping the posterior distribution.

In our Bayesian calibration process, the posterior distribution is proportionally influenced by both the likelihood and the prior.

Posterior Distribution ∝ Likelihood × Prior

Furthermore, the Beta distribution serves as a conjugate prior in Bayesian inference for many distribution. This means that if you’re dealing with distributions like Bernoulli, binomial, negative binomial and geometric, for the likelihood function then the resulting posterior will also belong to the Beta distribution family. In our case, the situation with anomalous weights the likelihood distribution is based on a success failure scenario much like a binomial distribution.

Now, let’s explore the options for the prior, considering the nature of the distribution:

Uninformative Prior

An uninformative prior has the least impact on the posterior, making it suitable when you possess minimal information about how the distribution should appear. In the Beta Distribution, examples of uninformative priors can include:

Beta(0.5, 0.5) or Jeffreys prior
Beta(1, 1) or the uniform prior.

This choice is ideal when you want the likelihood to be the dominant factor without substantial influence from the prior.

Mildly Informative Prior

Mildly informative priors convey some information to the posterior. In the Beta Distribution, options for mildly informative priors can be Beta(2, 2) and Beta(1, 2).

These priors provide a gentle nudge to the posterior based on partial knowledge.

Informative Prior

When you possess substantial information about the distribution and wish to make slight adjustments based on new observations, informative priors come into play. In the Beta Distribution context, informative priors could take the form of Beta(10, 10) and Beta(20, 2) and many values on the larger end. These priors carry more weight in shaping the posterior.

With a better understanding of the different types of priors and their roles, let’s return to our specific scenario of mapping the anomalous weights by MM Manufacturing into an observable posterior distribution

Python Implementation

So let’s do a little bit of Anomaly Detection using the Beta Distribution prior and bayesian calibration just to make the concept clearer.

First of all, to simulate the weights produced by the conveyor belt, we’ll generated synthetic data with 500 data points twice for the two scenarios below.

Scenario 1: Bayesian Calibration the first time

For the first time calibration, we use an uninformative prior denoted as Beta(1,1). We define the likelihood Beta(α , β) where α, β are:

α = correctly calibrated weights + 1 (since alpha should be > 0)

β = incorrectly calibrated weights + 1 (again for no events beta > 0)

We also generate our synthetic data where weight is considered correctly calibrated if the value is between 4.85 and 5.15, inclusive for a 5 pound weight and it’s incorrectly calibrated if weight lies outside these values.

We initially generate data with 10% anomalous values.

import random
import matplotlib.pyplot as plt
from scipy.stats import beta

# Simulated data: 500 observations with 90% normal weight and 10% anomalous weights
np.random.seed(42)
normal_instances =  [random.uniform(4.85, 5.15) for i in range(450)]
anomalous_instances_1 =  [random.uniform(3, 4.85) for i in range(25)]
anomalous_instances_2 =  [random.uniform(5.15, 6) for i in range(25)]

data = np.concatenate((normal_instances, anomalous_instances_1, anomalous_instances_2))

# Initial prior belief using a Beta distribution (uninformative uniform prior)
prior_alpha = 1
prior_beta = 1

# Beta Distribution as inferred by Observed data 
likelihood_alpha = len(data[(data >= 4.85) & (data <= 5.15)]) + 1
likelihood_beta = len(data) - likelihood_alpha + 1

# Calculate posterior parameters based on observed data and prior
posterior_alpha = prior_alpha + likelihood_alpha
posterior_beta = prior_beta + likelihood_beta

# Plot the prior, likelihood and posterior Beta distributions
x = np.linspace(0, 1, 1000)
prior_distribution = beta.pdf(x, prior_alpha, prior_beta)
likelihood_distribution = beta.pdf(x, likelihood_alpha, likelihood_beta)
posterior_distribution = beta.pdf(x, posterior_alpha, posterior_beta)

plt.plot(x, prior_distribution, label='Prior Distribution')
plt.plot(x, likelihood_distribution, label='Likelihood Distribution')
plt.plot(x, posterior_distribution, label='Posterior Distribution')

plt.title('Bayesian Calibration for Anomalous Weight Detection')
plt.xlabel('Anomaly Probability')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

As intended, our posterior is almost exactly like the likelihood so this wasn’t much of calibration. This also shows the impact of the uniform prior on the posterior.

Image by Author

The next month we have more data and now our prior is the posterior for previous month, similarly we could have had some information of the internal system and adjusted the prior accordingly.

Scenario 2: Bayesian Calibration update

Assuming the MM Manufacturing paid attention and made some changes to the system, now only 6% of the weights are anomalous. Now we have a more informative prior given the posterior from our previous data.

# Simulated data: 500 observations with 94% normal weight and 6% anomalous weights
np.random.seed(42)
normal_instances =  [random.uniform(4.85, 5.15) for i in range(470)]
anomalous_instances_1 =  [random.uniform(3, 4.85) for i in range(15)]
anomalous_instances_2 =  [random.uniform(5.15, 6) for i in range(15)]

data = np.concatenate((normal_instances, anomalous_instances_1, anomalous_instances_2))

# Initial prior belief about normal behavior using a Beta distribution
prior_alpha = posterior_alpha
prior_beta = posterior_beta

# Beta Distribution as inferred by Observed data 
likelihood_alpha = len(data[(data >= 4.85) & (data <= 5.15)]) + 1
likelihood_beta = len(data) - likelihood_alpha + 1

# Calculate posterior parameters based on observed data and prior
posterior_alpha = prior_alpha + likelihood_alpha
posterior_beta = prior_beta + likelihood_beta

# Plot the prior, likelihood and posterior Beta distributions
x = np.linspace(0, 1, 1000)
prior_distribution = beta.pdf(x, prior_alpha, prior_beta)
likelihood_distribution = beta.pdf(x, likelihood_alpha, likelihood_beta)
posterior_distribution = beta.pdf(x, posterior_alpha, posterior_beta)

plt.plot(x, prior_distribution, label='Prior Distribution')
plt.plot(x, likelihood_distribution, label='Likelihood Distribution')
plt.plot(x, posterior_distribution, label='Posterior Distribution')

plt.title('Bayesian Calibration for Anomalous Weight Detection')
plt.xlabel('Anomaly Probability')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

This time we see the impact of prior on the posterior and how much more defined the distribution is. The relation between prior, posterior and likelihood is much more clearly visible here.

Image by Author

Considering the two scenarios described above, it becomes evident that a variety of these outcomes can be leveraged to acquire insights into system performance, make comparisons to observe enhancements, and improve data calibration across a broad temporal spectrum.

The Beta Distribution’s appeal lies in its adaptability and versatility in defining various distributions, while Bayesian Calibration’s strength lies in its ability to flexibly embrace and integrate intricate model parameters.

Let’s talk about some other applications.

Other Applications

No discussion about the Beta Distribution would be complete without recognizing its wide-ranging uses. It’s not just used in the realm of Bayesian Inference and calibration, like we saw in the success-failure scenario earlier. The Beta Distribution also plays a crucial role in A/B testing, where it helps model the conversion rates of different web page or web app versions – a scenario similar to success and failure, just in a different context.

Furthermore the Beta Distribution can also come into play in risk analysis, where a probabilistic approach is highly informative for estimating the probability of success of a project.

Wrapping Up

In conclusion, the Beta Distribution, especially when applied in the context of Bayesian calibration, is an exceptionally valuable and elegant concept. It excels at handling the intricacies of a model while offering an intuitive approach to decision-making. Moreover, its relevance extends to a wide array of applications across various industries, where it plays a pivotal role in gaining valuable insights into the performance of the systems undergoing calibration.

The Beta Distribution is not just a theoretical concept; it’s a practical and valuable asset in the data scientist’s toolkit. Understanding its applications and adaptability opens doors to insights that can enhance decision-making and improve system performance. As you delve deeper into data science, remember the Beta Distribution’s significance in handling complex models and gaining valuable insights across various industries.

A cool way to visualize beta distribution

Beta Distribution – MIT Mathlets

Don’t forget to read some of my other intriguing articles!

P-Values: Understanding Statistical Significance in Plain Language

Exploring Counterfactual Insights: From Correlation to Causation in Data Analysis

Feel free to share your thoughts in the comments.

The post Beta Distributions: A Cornerstone of Bayesian Calibration appeared first on Towards Data Science.

]]>

The two envelopes problem

Fri, 08 Sep 2023 16:51:54 +0000

The Paradox Series #1

The Two Envelopes Problem

The two envelopes problem, leading to paradoxical and inconsistent decisions, appears when using an intuitive but wrong Bayesian probability estimation to determine the best course of action. Correcting for the mathematical mistake is simple, but there is more to it: first, by modifying the problem very slightly, we can make it undecidable – an example of language ambiguity as opposed to mathematical formalism; second, when comparing several possible solutions, we can observe how time is emerging in the mathematical world, theoretically allowing us to test causal hypotheses.

(TEP)

Imagine I show you two seemingly identical envelopes on a table, telling you (without lying) that both contain money, one twice as much as the other, and I propose you to take one of them and keep the money in it for yourself.

Image generated by Midjourney

Once you have chosen one, and before you open it, I ask you if you want to modify your choice and rather take the other envelope.

What would you do?

You would probably tell me that it would be useless to switch envelopes, as the situation is the same whatever envelope you choose. However, you should note that you have chosen an unknown amount of money x, and the amount y in the other envelope can be 2x or x/2 with equal probability, meaning the expected amount y is 2x (1/2) + x/2 (1/2) = 5x/4, which is greater than x. So maybe you should switch nevertheless?

Obviously you could also compute the expected amount x based on y, and because there is half a chance x is either 2y or y/2, you would find that the expected amount x is 5y/4, which is greater than y.

So what is wrong with this computation? Which envelope is more likely to contain more than the other, if any?

The mathematical flaw in the reasoning

We can arbitrarily label one envelope "X" and the other "Y". Let us now properly compute the conditional expectation of the amount in the envelope X when we know the amount y is in the Y envelope.

The expectation of the amount in X given the observed amount y in Y, noted E[X|Y = y], obviously depends on the specific amount y observed: even if over all possible values for y, the amount x in X can be either y/2 or 2y with a probability of 1/2 each time, it does not mean that this will be the case for specific values of y. For example, if y is "very small" (in a sense that will be clarified later), there is more chance that x is bigger than y, and if y is "very big", there is more chance that x is smaller than y: over all possible values for y, probabilities can be balanced so that X is half the time half Y, and half the time double Y, but it does not mean that P(X = y/2|Y = y) = 1/2 and P(X = 2y|Y = y) = 1/2, only that P(X = Y/2) = P(X = 2Y) = 1/2.

So we shall try to properly compute E[X|Y = y], but first we need to clarify the process that led us to have these two envelopes on the table, with labels "X" and "Y". Let us assume that we filled a first envelope with a random amount U, and a second envelope with an amount 2U. Then we shuffled them, and randomly named one of the envelopes X, while we named the other Y. We could represent this naming process as follows: we draw a binary number Z (half a chance of being 0 or 1). If Z = 0, X is the envelope with U in it, otherwise (if Z = 1) the envelope with the amount 2U.

Now we can see that for the exterior observer who is being asked to choose but has no idea of what random numbers were picked for U and Z, the amounts in the envelopes look like this:

We can verify that P(X = 2Y) = P(U + ZU = 4U – 2ZU) = P(3U – 3ZU = 0) = P(U=ZU) = P(Z = 1) = 1/2 (and it would be the same for P(X = Y/2)).

We still have to compute E[U|Y = y], and for this we need to know P(U=u|Y=y) that is (from Bayes’ theorem) proportional to P(Y=y|U=u)P(U=u).

To compute P(Y = y|U) we recall that Y is either U or 2U, meaning that the value u taken by U is either y or y/2:

when y is not u or u/2, there is no chance that Y = y: P(Y = y|U = u) = 0
when y is u, there is half a chance (Z = 1) that Y = y: P(Y = y|U = u) = 1/2
when y is u/2, there is half a chance (Z = 0) that Y = y: P(Y = y|U = u) = 1/2

With the mathematical formalism:

where:

All this summarizes as:

Then we have to know P(U = u). We can only make an assumption, e.g. that U is exponentially distributed on positive real numbers (with parameter λ>0):

In the end, P(U = u|Y = y) is proportional to:

In other words:

Now we have all we need to compute E[X|Y = y] = 3E[U|Y = y] – y, which is equal to:

Summarizing, we now know that:

This is quite different from the initial 5y/4 !

The expectation for x is (strictly) greater than y if and only if:

or said otherwise if and only if:

(which is twice the median of the exponential distribution of parameter λ from which the amounts are drawn).

So we can better understand the error in our previous reasoning. While it is true, by design, that X is half the time twice the amount y and half the time half this same amount when averaging over all possible values y, it is also true that for a specific value of y the probabilities are not half and half: if y is "big" compared to what is expected from the way the values U were picked, there is more probability that the envelope X contains a smaller amount, and if y is "small" there is on the contrary more chances for the envelope X to contain a bigger amount. Here the frontier between "big" and "small" is simply twice the median of the exponential distribution.

The choice of X or Y is symmetric, as E[Y|X = x] = E[3U – X|X = x] = 3E[U|X=x] – x and from here all previous computations still apply, mutatis mutandis.

It seems that the paradox is solved, but I claim that in reality the two envelopes problem can be undecidable, meaning that we cannot really know if the problem is symmetric, or if we should prefer one envelope to the other.

An undecidable problem

Let us now assume that on the table lie two envelopes seemingly identical except that they have already been labelled "X" and "Y". We are now told that the envelope X contains half the amount in Y or double this amount with half a chance for each possibility. By symmetry, the same applies to the envelope Y. You are now asked to choose one envelope: which one should you choose?

Based on the previous example, it seems obvious that we can choose indifferently one or the other. However, this is wrong! It all depends on our hypotheses, or said otherwise it all depends on the (statistical) representation of the problem.

Here, the fact that the envelopes are already labelled when we are given to choose one is key. What was the process to choose the amounts in the envelopes and label them? If they were randomly labelled like in the previous studied example, I would agree that choosing one or the other is statistically equivalent.

But let us imagine that the amount for X is chosen from an exponential distribution on positive real numbers (with parameter λ>0) similarly to what was done for U in the previous example. Then the amount for the envelope Y is simply randomly chosen as half or double the amount in Y (with uniform probabilities): Y = HX where H takes the values 1/2 or 2 with half a chance each time (H is independent from X).

Now let us compute the cumulative distribution of values for Y: P(Y < y) = P(HX < y) = P(HX < y |H = 1/2) P(H = 1/2) + P(HX < y |H = 2) P(H = 2)

= P(X/2 < y) (1/2) + P(2X < y) (1/2) = (1/2) P(X < 2y) + (1/2) P(X < y/2)

= (1/2) F(2y) + (1/2) F(y/2) where F is the cumulative distribution function of X (exponential distribution)

for non-negative values of y.

Differentiating to get the probability density for Y = y, we get:

This is the average of two probability density functions of exponential distributions, one of parameter λ/2 and the other of parameter 2λ, meaning that the average value in the envelope Y is the average of the averages 2/λ and 1/(2λ):

This is more than the average value of X, the mean of an exponential random variable of parameter λ being 1/λ (for those interested only in the computation of the expectation, E[Y] = E[HX] = E[H] E[X] as H and X are independent, and so E[Y] = [(1/2)(1/2) + 2(1/2)] E[X] = (5/4)E[X]).

The conclusion is that in this case, and if we care only about the mean to take a decision, then we should systematically choose the envelope Y.

However, we could also assume that instead of having Y = HX, we have X=HY, the amount in Y being drawn from an exponential distribution of parameter λ, and in that case we should rather choose the envelope X.

We do not know enough about the process that generated the two envelopes on the table to be able to decide with no additional assumption what envelope we should choose.

Is that all there is to say? No, the most interesting is still to come. We can see from what we did to that point that the physical processes to generate the situation with the envelopes have to be modeled with random variables.

But in physical processes, there is time: for example, we choose an amount for X, and then we deduce from it the amount to be put in Y, or the reverse; and the statistical model is able to reproduce it, with different conclusions if the amount of X is chosen before the amount of Y, or after. In other words, our statistical models are able to reproduce mathematically the physical reality of time.

The emergence of time and causality from randomness

It is often said that mathematics can only prove correlation, not causation. In that regard, Causality analysis in econometrics is no more than a correlation analysis as far as mathematics are involved. It is the human mind that decides that an event is the consequence of another one based on correlation between both events and based on time: the event coming after the first one can be only the consequence, not the cause.

Because time is not a mathematical concept but a physical one, mathematics seem to be helpless to establish causal relationships independently from any human input about what phenomenon happened first (thus being characterized as the cause) and what phenomenon happened second (thus being characterized as the consequence). But is it really the case? The concept of time originates in the concept of irreversibility: when an object moves from left to right, it is not a change due to time because the object can move back to its original location; when an object is aging, it is a change due to the passage of time because the process is irreversible. Time is the irreversible change in the states of the world.

In physics, irreversibility is viewed as a consequence of an increase in disorder, formally called entropy: it is because the molecules composing an object are getting more disordered that the object will never be able to revert to its initial state, and so the changes are not only viewed as happening in time, but because of the time. While changes in states are sufficient to say that time goes by, the physical irreversibility causes time to flow only in one direction, allowing us to distinguish causes and consequences.

Without entering too much into the details, only the macro-state of an aging object is not reversible: at a microscopic level, from the viewpoint of theoretical physics, molecules and particles can reorder themselves in a way similar to a past state. Thus, physical irreversibility could not simply be modeled by a non-invertible mathematical function, as this characteristic would be absent. Instead, random variables are macroscopically non-invertible but microscopically invertible: e.g. if Y = HX, it does not mean that X = Y/H (irreversibility from a macroscopic point of view), however for any values y, h and x taken by Y, H and X, y = hx and x = y/h (reversibility from a microscopic point of view). The two envelopes Paradox is particularly confusing because in its formulation everything seems symmetrical (if x is half or twice y, it implies that y is twice or half x), while this is only true at a "microscopic" level.

But how does the link between physical entropy and random variables could help in the study of causality?

Let us consider again the last example with two pre-labelled envelopes X and Y and let us assume we know that either Y = HX or X = HY, meaning that either Y is the consequence of X or vice versa. We can test each hypothesis by taking a large number of observations of X and Y, in order to identify the probability densities of these two random variables and one will have a "more entropic" density ("more entropic" under some specific mathematical relationship to be tested) as it will be based on the density of the other random variable, but "disordered" by the random variable H (whose density is assumed to be known).

Let us now consider more usual problems. Often linear regressions are performed to quantify a causal relationship between several variables. For instance, Y = αX where we assume Y is the consequence of X, and we want to quantify the causality coefficient α. However, it does not prove in any way a causal relationship from X to Y, it only allows to quantify the assumed relationship between X and Y if the assumption is true.

With such a simple example where Y is assumed to be equal to αX, it is not possible to identify mathematically a causal relationship, because it is equivalent to say that X = Y/α. However, if the coefficient α is considered to be one historic value of the more general process A, it is possible to compare the distributions of Y, A and X and see which one is more plausible of Y = AX or X = Y/A. Another example would be the study of a relationship Z = X + Y (Z is caused by X and Y), to be compared to other possibilities such that Y = Z – X (Y is caused by X and Z): a comparison of the distributions of X, Y and Z would provide an answer to the causality problem.

While such considerations are very theoretical and would not prove themselves directly useful in real life, where properly estimating the distributions of the random variables could be costly, complicated or unfeasible, it is possible to imagine using aggregates to perform a causality analysis. For example, in the case where we have to choose between Y = HX and X = HY, we have seen that in the first case E[Y] > E[X] and that in the second case E[X] > E[Y]. In case of linear relationships, we could have to test between X = Y + Z, Y = X – Z and Z = X – Y, but then the expectations are not useful (except if we take the exponential, e.g. exp(X)=exp(Y)exp(Z)), as E[X] is equal to E[Y] + E[Z] in every case, but the relationship Var(X) = Var(Y) + Var(Z) + 2Cov(Y, Z) would be true only in the first one.

Such techniques could provide useful indications about causal relationships, and help in testing hypotheses. But even more importantly, is it not beautiful that the physical time of our world emerges in the mathematical world from the concept of randomness?

Conclusion

Starting by analyzing a well-known statistical "paradox", the Two Envelopes Problem, we recognized that the paradox emerged not only because of a mathematical flaw in the naïve solution of the problem, but also because there is some ambiguity in the human language that made it look like two distinct functions of random variables (HX and X/H) were equivalent.

Digging further, it appeared that equations involving random variables, while impossible to "reverse" in the general case (macroscopic view), were "reversible" when considering instead realizations of the random variables (microscopic view).

This was then the occasion to propose an analogy between the sample space Ω of the random variables and the phase space of physical systems, leading subsequently to observe the emergence of "physical entropy" in the statistical world and thus of irreversibility and time.

Finally, after time emerged from our obscure computations, we were able to draw conclusions about ways to test causality hypotheses that go beyond simple correlation analyses. All this from two envelopes!

The post The two envelopes problem appeared first on Towards Data Science.

]]>

Understanding Bayesian Marketing Mix Modeling: A Deep Dive into Prior Specifications

Sat, 24 Jun 2023 06:19:52 +0000

Photo by Pawel Czerwinski on Unsplash

Bayesian marketing mix modeling has been receiving more and more attention, especially with the recent releases of open source tools like LightweightMMM (Google) or PyMC Marketing (PyMC Labs). Although these frameworks simplify the complexities of Bayesian modeling, it is still crucial for the user to have an understanding of fundamental Bayesian concepts and be able to understand the model specification.

In this article, I take Google’s Lightweightmmm as a practical example and show the intuition and meaning of the prior specifications of this framework. I demonstrate the simulation of prior samples using Python and the scipy library.

Data

I use the data made available by Robyn under MIT Licence.

The dataset consists of 208 weeks of revenue (from 2015–11–23 to 2019–11–11) having:

5 media spend channels: tv_S, ooh_S, print_S, facebook_S, search_S
2 media channels that have also the exposure information (Impression, Clicks): facebook_I, search_clicks_P
Organic media without spend: newsletter
Control variables: events, holidays, competitor sales (competitor_sales_B)

LightweightMMM Model Specification

The specification of the LightweightMMM model is defined as follows:

LMMM Model Specification (image by the author)

This specification represents an additive linear regression model that explains the value of a response (target variable) at a specific time point t.

Let’s break down each component in the equation:

α: This component represents the intercept or the baseline value of the response. It is the expected value of the response when all other factors are zero.
trend: This component captures the increasing or decreasing trend of the response over time.
seasonality: This component represents periodic fluctuations in the response.
_media_channels_: This component accounts for the influence of media channels (tv, radio, online ads) on the response.
_other_factors_: This component encompasses any other variables that have influence on the response such as weather, economic indicators or competitor activities.

Below, I go through each of the components in detail and explain how to interpret the prior specifications. As a reminder, a prior distribution is an assumed distribution of some parameter without any knowledge of the underlying data.

Intercept

Intercept prior specification (image by the author)

The intercept is defined to follow a half-normal distribution with a standard deviation of 2. A half-normal distribution is a continuous probability distribution that resembles a normal distribution but is restricted to positive values only. The distribution is characterized by a single parameter, the standard deviation (scale). Half-normal distribution implies that the intercept can get only positive values.

The following code generates samples from the prior distribution of the intercept and visualizes the probability density function (PDF) for a half-normal distribution with a scale of 2. For visualizations of other components, please refer to the accompanying source code in the Github repo.

from scipy import stats

scale = 2
halfnormal_dist = stats.halfnorm(scale=scale)
samples = halfnormal_dist.rvs(size=1000)

plt.figure(figsize=(20, 6))
sns.histplot(samples, bins=50, kde=False, stat='density', alpha=0.5)
sns.lineplot(x=np.linspace(0, 6, 100), 
      y=halfnormal_dist.pdf(np.linspace(0, 6, 100)), color='r')

plt.title(f"Half-Normal Distribution with scale={scale}")
plt.xlabel('x')
plt.ylabel('P(X=x)')
plt.show()

Half Normal Distribution (image by the author)

Trend

Trend specification (image by the author)

The trend is defined as a power-law relationship between time t and the trend value. The parameter μ represents the amplitude or magnitude of the trend, while k controls the steepness or curvature of the trend.

The parameter μ is drawn from a normal distribution with a mean of 0 and a standard deviation of 1. This implies that μ follows a standard normal distribution, centered around 0, with standard deviation of 1. The normal distribution allows for positive and negative values of μ, representing upward or downward trends, respectively.

The parameter k is drawn from a uniform distribution between 0.5 and 1.5. The uniform distribution ensures that k takes values that result in a reasonable and meaningful curvature for the trend.

The plot below depicts separate components obtained from the prior distributions: a sample of the intercept and trend, each represented individually.

Trend and Intercept (image by the author)

Seasonality

Seasonality specification (image by the author)

Each component γ is drawn from a normal distribution with a mean of 0 and a standard deviation of 1.

By combining the cosine and sine functions with different γ, cyclic patterns can modeled to capture the seasonality present in the data. The cosine and sine functions represent the oscillating behavior observed over the period of 52 units (weeks).

The plot below illustrates a sample of the seasonality, intercept and trend obtained from the prior distributions.

Seasonality, Trend and Intercept (image by the author)

Other factors (control variables)

Other Factors specification (image by the author)

Each factor coefficient λ is drawn from a normal distribution with a mean of 0 and a standard deviation of 1, which means that λ can take positive or negative values, representing the direction and magnitude of the influence each factor has on the outcome.

The plot below depicts separate components obtained from the prior distributions: a sample of the intercept, trend, seasonality and control variables (_competitor_salesB, newsletter, holidays and events) each represented individually.

Other factors (combined) (image by the author)

Media Channels

Media Channels prior specification (image by the author)

The distribution for β coefficient of a media channel m is specified as a half-normal distribution, where the standard deviation parameter v is determined by the sum of the total cost associated with media channel m. The total cost reflects the investment or resources allocated to that particular media channel.

Media Transformations

Adstock and Hill Saturation Specification (image by the author)

In these equations, we are modeling the media channels’ behavior using a series of transformations, such as adstock and Hill saturation.

Modeling Marketing Mix using PyMC3

The variable media channels represents the transformed media channels at time point t. It is obtained by applying a transformation to the raw media channel value x. The Hill transformation is controlled by the parameters K a half saturation point (0 < k ≤ 1), and shape S controlling the steepness of the curve (s > 0).

The variable x∗ represents the transformed media channels value at time t after undergoing the adstock transformation. It is calculated by adding the current raw media channel value to the product of the previous transformed value and the adstock decay parameter λ.

Parameters K and S follow gamma distributions with shape and scale parameters both set to 1, while λ follows a beta distribution with shape parameters 2 and 1.

The probability density function of the Hill Saturation parameters K and S are illustrated in the plot below:

shape = 1
scale = 1

gamma_dist = stats.gamma(a=shape, scale=scale)
samples = gamma_dist.rvs(size=1000)

plt.figure(figsize=(20, 6))
sns.histplot(samples, bins=50, kde=False, stat='density', alpha=0.5)
sns.lineplot(x=np.linspace(0, 6, 100), y=gamma_dist.pdf(np.linspace(0, 6, 100)), color='r')

plt.title(f"Gamma Distribution for $K_m$ and $S_m$ with shape={shape} and scale={scale}")
plt.xlabel('x')
plt.ylabel('P(X=x)')

# Show the plot
plt.show()python

Gamma distribution (image by the author)

The probability density function of the adstock parameter λ is shown in the plot below:

Beta distribution (image by the author)

A Note on the specification of the adstock parameter λ:

The probability density function of the Beta(α = 2, β = 1) distribution exhibits a positive trend, indicating that higher values have a higher probability density. In media analysis, different industries and media activities may demonstrate varying decay rates, with most media channels typically exhibiting small decay rates. For instance, Robyn suggests the following ranges of λ decay for common media channels: TV (0.3–0.8), OOH/Print/Radio (0.1–0.4), and digital (0–0.3).

In the context of the Beta(α = 2, β = 1) distribution, higher probabilities are assigned to λ values closer to 1, while lower probabilities are assigned to values closer to 0. Consequently, outcomes or observations near the upper end of the interval [0, 1] are more likely to occur compared to outcomes near the lower end.

Alternatively, in the Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects, the decay parameter is defined as Beta(α = 3, β = 3), whose probability density function is illustrated below. This distribution is symmetric around 0.5, indicating an equal likelihood of observing outcomes at both extremes and near the center of the interval [0, 1].

Beta(3,3) (image by the author)

The plot below depicts separate components obtained from the prior distributions: a sample of the intercept, trend, seasonality, control variables and media channels, each represented individually.

All model components (image by the author)

Combining all components

As mentioned earlier, LightweightMMM models an additive linear regression by combining various components such as intercept, trend, seasonality, media channels, and other factors sampled from their prior distributions to obtain the predictive response. The plot below visualizes the true response and the expected response sampled from the prior predictive distribution.

Visualizing a single sample against the true response value allows us to observe how the model’s prediction compares to the actual outcome for a specific set of parameter values. It can provide an intuitive understanding of how the model performs in that particular instance.

Revenue: True vs. Prior (image by the author)

Prior predictive check

In order get more robust insights, it is generally recommended to sample multiple times from the prior predictive distribution and measure the uncertainty. The prior predictive check helps assess the adequacy of the chosen model and evaluate whether the model’s predictions align with our expectations, before observing any actual data.

The plot depicted below visualizes the prior predictive distribution by showing the expected revenue (mean) at each point, along with measures of uncertainty. We can see that the true revenue falls within the range of the standard deviation, indicating that the model specification is suitable for the observed data.

Prior predictive check (image by the author)

Conclusion

Bayesian Marketing Mix Modeling may take considerable time to master. I hope that this article helped you to enhance your understanding of prior distributions and Bayesian marketing model specifications.

The complete code can be downloaded from my Github repo

Thanks for reading!

The post Understanding Bayesian Marketing Mix Modeling: A Deep Dive into Prior Specifications appeared first on Towards Data Science.

]]>

Alternatives to the p-value Criterion for Statistical Significance (with R code)

Sun, 12 Mar 2023 06:58:32 +0000

Photo by Rommel Davila on Unsplash

In establishing statistical significance, the p-value criterion is almost universally used. The criterion is to reject the null hypothesis (H0) in favour of the alternative (H1), when the p-value is less than the Level Of Significance (α). The conventional values for this decision threshold include 0.05, 0.10, and 0.01.

By definition, the p-value measures how compatible the sample information is with H0: i.e., P(D|H0), the probability or likelihood of data (D) under H0. However, as made clear from the statements of the American Statistical Association (Wasserstein and Lazar, 2016), the p-value criterion as a decision rule has a number of serious deficiencies. The main deficiencies include

the p-value is a decreasing function of sample size;
the criterion completely ignores P(D|H1), the compatibility of data with H1; and
the conventional values of α (such as 0.05) are arbitrary with little scientific justification.

One of the consequences is that the p-value criterion frequently rejects H0 when it is violated by a practically negligible margin. This is especially so when the sample size is large or massive. This situation occurs because, while the p-value is a decreasing function of sample size, its threshold (α) is fixed and does not decrease with sample size. On this point, Wasserstein and Lazar (2016) strongly recommend that the p-value be supplemented or even replaced with other alternatives.

In this post, I introduce a range of simple, but more sensible, alternatives to the p-value criterion which can overcome the above-mentioned deficiencies. They can be classified into three categories:

Balancing P(D|H0) and P(D|H1) (Bayesian method);
Adjusting the level of significance (α); and
Adjusting the p-value.

These alternatives are simple to compute, and can provide more sensible inferential outcomes than those solely based on the p-value criterion, which will be demonstrated using an application with R codes.

Background

Consider a linear regression model

Y = β0 + β1 X1 + … + βk Xk + u,

where Y is the dependent variable, X’s are independent variables, and u is a random error term following a normal distribution with zero mean and fixed variance. We consider testing for

H0: β1 = … = βq = 0,

against H1 that H0 does not hold (q ≤ k). A simple example is H0: β1 = 0; H1: β1 ≠ 0, where q =1.

Borrowing from the Bayesian statistical inference, we define the following probabilities:

Prob(H0|D): posterior probability for H0, which is the probability or likelihood of H0 after the researcher observes the data D;

Prob(H1|D) ≡ 1 – Prob(H0|D): posterior probability for H1;

Prob(D|H0): (marginal) likelihood of data under H0;

Prob(D|H1): (marginal) likelihood of data under H1;

P(H0): prior probability for H0, representing the researcher’s belief about H0 before she observes the data;

P(H1) = 1- P(H0): prior probability for H1.

These probabilities are related (by Bayes rule) as

The main components are as follows:

P10: the posterior odds ratio for H1 over H0, the ratio of the posterior probability of H1 to that of H0;

B10 ≡ P(D|H1)/P(D|H0) called the Bayes factor, the ratio of the (marginal) likelihood under H1 to that of H0;

P(H1)/P(H0): prior odds ratio.

Note that the posterior odds ratio is the Bayes factor multiplied by the prior odds ratio, and that that P10 = B10 if Prob(H0) = Prob(H1) = 0.5.

The decision rule is, if P10 > 0, the evidence favours H1 over H0. This means that, after the researcher observes the data, she favours H1 if P(H1|D) > P(H0|D), i.e., if the posterior probability of H1 is higher than that of H0.

For B10, the decision rule proposed by Kass and Raftery (1995) is given below:

Image created by the author

For example, if B10 = 3, then P(D|H1) = 3 × P(D|H0), which means that the data is compatible with H1 three times more than it is compatible with H0. **** Note that the Bayes factor is sometimes expressed as 2log(B10), where log() is the natural logarithm, in the same scale as the likelihood ratio test statistic.

Formulae for the alternatives

Bayes factor

Wagenmakers (2007) provides a simple approximation formula for the Bayes factor given by

2log(B10) = BIC(H0) – BIC(H1),

where BIC(Hi) denotes the value of the Bayesian information criterion under Hi (i = 0, 1).

Posterior probabilities

Zellner and Siow (1979) provide a formula for P10 given by

where F is the F-test statistic for H0, Γ() is the gamma function, v1 = n-k0-k1–1, n is the sample size, k0 is the number of parameters restricted under H0; and k1 is the number of parameters unrestricted under H0 (k = k0+k1).

Startz (2014) provides a formula for P(H0|D), posterior probability for H0, to test for H0: βi = 0:

where t is the t-statistic for H0: βi = 0, ϕ() is the standard normal density function, and s is the standard error estimator for the estimation of βi.

Adjustment to the p-value

Good (1988) proposes the following adjustment to the p-value:

where p is the p-value for H0: βi = 0. The rule is obtained by considering the convergence rate of the Bayes factor against a sharp null hypothesis. The adjusted p-value (_p_1) increases with sample size n.

Harvey (2017) proposes what is called the Bayesianized p-value

where PR ≡ P(H0)/P(H1) and MBF = exp(-0.5t²) is the minimum Bayes factor while t is the t-statistic.

Significance level adjustment

Perez and Perichhi (2014) propose an adaptive rule for the level of significance derived by reconciling the Bayesian inferential method and likelihood ratio principle, which is written as follows:

where q is number of parameters under H0, α is the initial level of significance such as 0.05, and _χ_²(α,q) is the α-level critical value from the chi-square distribution with q degrees of freedom. In short, the rule adjusts the level of significance as a decreasing function of sample size n.

Application with R Codes

In this section, we apply the above alternative measures to a regression with a large sample size, and examine how the inferential results are different from those obtained solely based on the p-value criterion. The R codes for the calculation of these measures are also provided.

Kamstra et al. (2003) examine the effect of depression linked with seasonal affective disorder on stock return. They claim that the length of sunlight can systematically affect the variation in stock return. They estimate the regression model of the following form:

Image created by the author

where R is the stock return in percentage on day t; M is a dummy variable for Monday; T is a dummy variable for the last trading day or the first five trading days of the tax year; A is a dummy variable for autumn days; C is cloud cover, P is precipitation; G is temperature, and S measures the length of sunlights.

They argue that, with a longer sunlight, investors are in a better mood, and they tend to buy more stocks which will increase the stock price and return. Based on this, their null and alternative hypotheses are

H0: γ3 = 0; H1: γ3 ≠ 0.

Their regression results are replicated using the U.S. stock market data, daily from Jan 1965 to April 1996 (7886 observations). The data range is limited by the cloud cover data which is available only from 1965 to 1996. The full results with further details are available from Kim (2022).

Image created by the author

The above table presents a summary of the regression results under H0 and H1. The null hypothesis H0: γ3 = 0 is rejected at the 5% level of significance, with the coefficient estimate of 0.033, t-statistic of 2.31, and p-value of 0.027. Hence, based on the p-value criterion, the length of sunlight affects the stock return with Statistical Significance: the stock return is expected to increase by 0.033% in response to a 1-unit increase in the length of sunlight.

While this is evidence against the implications of stock market efficiency, it may be argued that whether this effect is large enough to be practically important is questionable.

The values of the alternative measures and the corresponding decisions are given below:

Image created by the author

Note that P10 and p2 are calculated under the assumption that P(H0)=P(H1), which means that the researcher is impartial between H0 and H1 a priori. It is clear from the results in the above table that all of the alternatives to the p-value criterion strongly favours H0 over H1 or cannot reject H0 at the 5% level of significance. Harvey’s (2017) Bayesianized p-value that indicates rejection of H0 at the 10% level of significance.

Hence, we may conclude that the results of Kamstra et al. (2003), based solely on the p-value criterion, are not so convincing under the alternative decision rules. Given the questionable effect size and nearly negligible goodness-of-fit of the model (R² = 0.056), the decisions based on these alternatives seem more sensible.

The R code below shows the calculation of these alternatives (the full code and data are available from the author on request):

# Regression under H1
Reg1 = lm(ret.g ~ ret.g1+ret.g2+SAD+Mon+Tax+FALL+cloud+prep+temp,data=dat)
print(summary(Reg1))
# Regression under H0
Reg0 = lm(ret.g ~ ret.g1+ret.g2+Mon+FALL+Tax+cloud+prep+temp, data=dat)
print(summary(Reg0))

# 2log(B10): Wagenmakers (2007)
print(BIC(Reg0)-BIC(Reg1))

# PH0: Startz (2014)
T=length(ret.g); se=0.014; t=2.314
c=sqrt(2*3.14*T*se^2); 
Ph0=dnorm(t)/(dnorm(t) + se/c)
print(Ph0)

# p-valeu adjustment: Good (1988) 
p=0.0207
P_adjusted = min(c(0.5,p*sqrt(T/100))) 
print(P_adjusted)

# Bayesianized p-value: Harvey (2017)
t=2.314; p=0.0207
MBF=exp(-0.5*t^2)
p.Bayes=MBF/(1+MBF)
print(p.Bayes)

# P10: Zellner and Siow (1979)
t=2.314
f=t^2; k0=1; k1=8; v1 = T-k0-k1- 1
P1 =pi^(0.5)/gamma((k0+1)/2)
P2=(0.5*v1)^(0.5*k0)
P3=(1+(k0/v1)*f)^(0.5*(v1-1))
P10=(P1*P2/P3)^(-1)
print(P10)

# Adaptive Level of Significance: Perez and Perichhi (2014)
n=T;alpha=0.05
q = 1                           # Number of Parameters under H0
adapt1 = ( qchisq(p=1-alpha,df=q) + q*log(n) )^(0.5*q-1)
adapt2 = 2^(0.5*q-1) * n^(0.5*q) * gamma(0.5*q)
adapt3 = exp(-0.5*qchisq(p=1-alpha,df=q))
alphas=adapt1*adapt3/adapt2
print(alphas)

Conclusion

The p-value criterion has a number of deficiencies. Sole reliance on this decision rule has generated serious problems in scientific research, including accumulation of wrong stylized facts, research integrity, and research credibility: see the statements of the American Statistical Association (Wasserstein and Lazar, 2016).

This post presents several alternatives to the p-value criterion for statistical evidence. A balanced and informed statistical decision can be made by considering the information from a range of alternatives. Mindless use of a single decision rule can provide misleading decisions, which can be highly costly and consequential. These alternatives are simple to calculate and can supplement the p-value criterion for better and more informed decisions.

Please Follow Me for more engaging posts!

The post Alternatives to the p-value Criterion for Statistical Significance (with R code) appeared first on Towards Data Science.

]]>