2023-05-03

In hypothesis testing based on Bayesian statistics, there is a method that uses a quantity called the Bayes Factor. I believe that when Bayesian methods are mentioned as a way to overcome the problems of frequentist testing, it often refers to techniques like the Bayes Factor. In my work, I have mostly encountered situations where frequentist testing has been sufficient, so I have not dealt with this much. However, I recently studied it and would like to provide a brief summary.

To understand hypothesis testing using Bayes Factors, let’s compare it with classical frequentist methods and those using Bayesian credible intervals. I think it is easier to understand by bringing up an analytically calculable example, so let’s consider a simple coin toss experiment. Let’s say we toss a coin $N$ times and obtain heads $m$ times. We want to determine whether this coin is unbiased or not. Let the probability of getting heads be $\mu$. We are then faced with the problem of choosing between:

- Null hypothesis ($H_0$): $\mu = 0.5$
- Alternative hypothesis ($H_1$): $\mu \neq 0.5$

First, let’s briefly summarize hypothesis testing using frequentist methods. In frequentist statistics, we find the probability distribution of the test statistic under $H_0$ and calculate the probability of obtaining the observed value (*p-value*). For the coin toss experiment, we can perform a binomial test or a chi-squared test. We calculate the p-value and reject $H_0$ by contradiction if it is smaller than a predetermined threshold value.

Some characteristics of frequentist hypothesis testing are:

- Typically, it deals with cases where the distribution of the statistic can be analytically calculated. Even when this is not the case, if there are a large number of samples, it can sometimes be applied through the central limit theorem. As a result, analyses can often be performed with light numerical calculations.
- While it is possible to reject the null hypothesis, it is not possible to accept it. Frequentist testing has a structure similar to proof by contradiction, so we can conclude that “since there is a contradiction, the assumption is false,” but we cannot actively claim that the null hypothesis is true.
- (We won’t go into detail here, but) it violates the likelihood principle.

Next, let’s look at hypothesis testing using Bayesian credible intervals, which is one of the methods using Bayesian statistics.

One of the hypothesis testing methods using Bayesian statistics is the use of credible intervals. This method applies the results of parameter estimation in a statistical model to hypothesis testing.

Here, let’s model the coin toss experiment using a binomial distribution. We set the prior of the probability of getting heads, $\mu$, to a beta distribution: $$ \begin{aligned} m &\sim \mathrm{B}(N, \mu), \\ \mu &\sim \mathrm{Beta}(a, b). \end{aligned} $$

Here, $a$ and $b$ are hyperparameters. As is well-known, the beta distribution is a conjugate prior for the binomial distribution, and the posterior can be analytically calculated: $$ \mu \sim \mathrm{Beta}(m+a, N-m+b). $$

When using this for hypothesis testing, we calculate the interval with a high probability of containing $\mu$ in the posterior distribution (a credible interval), and check whether the null hypothesis is included in that interval. In this example, the null hypothesis is $\mu=0.5$, and if this value is outside the credible interval, we can reject the null hypothesis. Conversely, if it is inside the interval, we cannot say that the data actively supports the alternative hypothesis.

The method using credible intervals has the following characteristics:

- We can calculate the probability distribution of $\mu$. Thus, the obtained credible interval can be straightforwardly interpreted as the “interval containing the parameter with a certain probability”. In frequentist statistics, $\mu$ was not a random variable, so its distribution could not be calculated. Although there is a similar concept of confidence intervals in frequentist statistics, caution is required in interpreting them since $\mu$ is not a random variable.
- Generally, as the sample size increases, the influence of the prior distribution decreases. In the example above, if $m, N-m \gg a, b$, the result becomes insensitive to the choice of $a$ and $b$.
- To obtain the posterior distribution, heavy numerical computations such as MCMC are generally required.
- It does not treat the null hypothesis and alternative hypothesis equally. In the example above, the null hypothesis is a single point, $\mu=0.5$, but the probability of $\mu$ taking this value is always zero under the posterior distribution $\mathrm{Beta}(m+a, N-m+b)$. Therefore, it seems unsuitable for cases where the focus is on whether the null hypothesis can be accepted or not. This point is discussed in detail in Rouder, Haaf, and Vandekerckhove (2018).

Although the method using credible intervals involves Bayesian statistics, it essentially involves parameter estimation of probability distributions. While there are some advantages, such as the interval estimates obtained being easier to interpret compared to frequentist approaches, the method still isn’t suitable for actively adopting null hypotheses.

On the other hand, the Bayes factor is a method that utilizes the model comparison concept of Bayesian statistics and is capable of overcoming this difficulty.

What we truly want to evaluate is the posterior probability of hypothesis $H$ given data $\mathcal{D}$, represented as $p(H|\mathcal{D})$. In Bayesian statistics, hypotheses must be represented as statistical models. Let’s denote this as $\mathcal{M}$. Using Bayes’ theorem, the posterior probability of model $\mathcal{M}$ can be written as: $$ p(\mathcal{M}|\mathcal{D}) = \frac{p(\mathcal{D}|\mathcal{M})p(\mathcal{M})}{p(\mathcal{D})} $$ Taking the ratio of this quantity between the null hypothesis and the alternative hypothesis gives us:

$$ \frac{p(\mathcal{M}_1|\mathcal{D})}{p(\mathcal{M}_0|\mathcal{D})} = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_0)} \times \frac{p(\mathcal{M}_1)}{p(\mathcal{M}_0)} $$ The first term on the right-hand side: $$ BF_{10} = \frac{p(\mathcal{D}|\mathcal{M}_1)}{p(\mathcal{D}|\mathcal{M}_0)} $$ is called the Bayes factor and is the ratio of marginal likelihoods $p(\mathcal{D}|\mathcal{M})$. $p(\mathcal{M}_1)/p(\mathcal{M}_0)$ is the ratio of prior beliefs for each model. If the Bayes factor can be calculated, this means that we can update these odds based on the data.

In particular, setting the prior odds to 1 allows us to test hypotheses based on how far the Bayes factor deviates from 1. If the Bayes factor is significantly larger than 1, the alternative hypothesis is accepted, and if it is significantly smaller than 1, the null hypothesis is accepted. While previous methods have been somewhat heuristic, the Bayes factor naturally evaluates the posterior probability of a model in accordance with the laws of probability, making it theoretically straightforward.

As the Bayes factor is a test, it is necessary to determine a threshold for accepting or rejecting hypotheses based on its value depending on each industry or situation. For example, some commonly used criteria are summarized in Kass and Raftery (1995).

The crux of the Bayes factor lies in the marginal likelihood. Let’s use the coin toss example to calculate it. Just like in the case of credible intervals, if we model the coin toss with a binomial distribution, under $\mathcal{M}_0$ with $\mu=1/2$, we get: $$ p(\mathcal{D}|\mathcal{M}_0) = \binom{N}{m}2^{-N} $$

On the other hand, the case of $\mathcal{M}_1$ gets a bit more complicated. Since $\mu \neq 1/2$ is a continuous quantity, we must perform the following integration: $$ p(\mathcal{D}|\mathcal{M}_1) = \int_0^1d\mu\ p(\mathcal{D}|\mathcal{M}_1, \mu)p(\mu|\mathcal{M}_1). $$

Let’s adopt the same binomial distribution as before for the likelihood $p(\mathcal{D}|\mathcal{M}_1, \mu)$: $$ p(\mathcal{D}|\mathcal{M}_1, \mu) = \binom{N}{m}\mu^m(1-\mu)^{N-m}, $$ and assume a Beta distribution $\mathrm{Beta}(a, b)$ for the prior distribution of $\mu$, $p(\mu|\mathcal{M}_1)$. The density function of the Beta distribution is: $$ p(\mu|\mathcal{M}_1) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\mu^{a-1}(1-\mu)^{b-1} $$ With this, the marginal likelihood can be analytically calculated as: $$ p(\mathcal{D}|\mathcal{M}_1) = \frac{N!}{m!(N-m)!}\cdot\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\cdot\frac{\Gamma(m+a)\Gamma(N-m+b)}{\Gamma(a+b+N)} $$ In particular, when $m, N-m \gg a, b$, we can obtain: $$ \ln p(\mathcal{D}|\mathcal{M}_1) \simeq \ln\left(\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\right) +(a-1)\ln m+(b-1)\ln(N-m)-(a+b-1)\ln N. $$

As can be seen from this expression, the marginal likelihood depends on the prior distribution of $\mu$, $p(\mu|\mathcal{M}_1)$. In particular, the influence of the prior distribution persists even as the sample size increases. Of course, we are free to choose a prior distribution other than the Beta distribution, which would provide even more degrees of freedom than what this expression represents.

In the case of Bayesian credible intervals, the influence of the prior distribution tended to diminish as the sample size increased. On the other hand, the marginal likelihood is the result of multiplying the likelihood by the prior distribution and integrating it, so the value will be small if the prior distribution is either too narrow or too wide. ^{1} Therefore, for example, using an overly uninformative distribution can result in a smaller marginal likelihood for the alternative hypothesis and greater support for the null hypothesis.

Because of this property, when using Bayes factors for hypothesis testing, it is important to utilize domain knowledge in the design of prior distributions or to use distributions that have been agreed upon within the community. Additionally, sensitivity analysis regarding the choice of prior distributions is necessary. While there are default priors designed to alleviate this burden to some extent, there is criticism against blindly using them. There are various discussions about prior distributions on Reddit, and in a more practical setting, the importance of constructing prior distributions using past experimental results has been advocated in Bing’s A/B testing case study.

Apart from sensitivity to prior distributions, let’s summarize some features of Bayes factors:

- With Bayes factors, it is possible to accept the null hypothesis. This is because both the null hypothesis and the alternative hypothesis are treated equally.
- Similar to the case with credible intervals, in general, it is not possible to perform the integrations analytically, so some numerical calculations are required. In R, there is a package called BayesFactor for this purpose.
- It allows for the comparison of any two hypotheses. In the case of frequentist methods or Bayesian credible intervals, the null hypothesis and the alternative hypothesis had the same binomial distribution for coin tosses, and the value of the parameter $\mu$ was examined. With Bayes factors, it is possible to compare models with completely different distributions.

The above content is summarized in the table below:

Accepting Null Hypothesis | Comparison of Any Hypotheses | Dependence on Prior | Computational Cost | |
---|---|---|---|---|

Frequentist | Not possible | Not possible | - | Low |

Bayesian Credible Intervals | Not possible | Not possible | Low | Generally high |

Bayes Factors | Possible | Possible | High | Generally high |

Personally, considering the difficulty of calculations and sensitivity analysis, my impression is that the use of Bayes factors in practice seems to be limited to cases where analysis is not possible with other methods, such as wanting to accept the null hypothesis.

In the case of common RCT effectiveness evaluations, I think it is easier to use frequentist hypothesis testing after properly setting the effect size and sample size design.

- Bayesian inference for psychology, part IV parameter estimation and Bayes factors
- Chapter 15 Bayes factors | An Introduction to Bayesian Data Analysis for Cognitive Science
- Bayes factor
- 仮説検定の手法ごとに結論がどれほど変わるか比べてみた | Blogicoffee
- ベイズ統計の仮説検定 - Qiita
- ベイズファクターによる心理学的仮説・モデルの評価

For example, there is an intuitive explanation in Chapter 3 of PRML.