This is an introduction to Bayes’ Theorem, its definition, use and importance. The term, introduction, is used for there is a lot more that can, should, and will be discussed relative to Bayes’ Theorem. The intent here is to lay a foundational understanding so that more in-depth topics such as Bayesian Inference can be covered in subsequents posts. Bayesian Inference (and all that goes with it, such as Markov Chain Monte Carlo, or MCMC) is quite fascinating and certainly worth further study. It will, however, will be briefly discussed here.
The following sections are presented herein:
This section is not a comprehensive review of probability, but will provide a brief overview of probability sufficient to understand Bayes’ Theorem.
To understand probability, need to first give some definitions:
There are three definitions of probability, classical, empirical and subjective. Given an event \(A\), then the probability of \(A\), \(P(A)\), is defined as follows:
\[ P(A) = \frac{\text{Number of outcomes that satisfy the event}}{\text{Total number of outcomes in sample space}} \]
\[ P(A) = \frac{\text{Number of times the event A occurs in repeated trials}}{\text{Total number of trials in a random experiment}} \]
The following briefly highlights key fundamental probability rules (using the term rule loosely) and definitions:
Probability of an Event: The probability of an event must be between zero and one, inclusively. That is, for an event \(A\), \(0 \leq P(A) \leq 1\). A probability of 1 implies the event happens with absolute certainity. A probability of 0 implies the event is a non possibility.
Sum of Probabilities: The sum of the probabilities of all possible outcomes must equal 1.
Complement: For a given event, \(A\), the complement of \(A\) are those outcomes in the sample space that are not outcomes in \(A\). There are many different ways to denote the complement of \(A\); the notation, \(\text{~}A\), will be used here. The \(P(\text{~}A) = 1 - P(A)\).
Intersection: Given two events, \(A\) and \(B\), the intersection of \(A\) and \(B\), denoted as \(A \cap B\), is the combination of all outcomes that are elements in \(A\) and \(B\) (i.e, in both).
Joint Probability: Given two events,\(A\) and \(B\), then the probability of \(A\) and \(B\) is defined as \(P(A \cap B)\), Note, that joint proabilities are sometimes referred to as conjoint probabilities.
Disjoint Events: Given two events, \(A\) and \(B\), these events are disjoint, or mutually exclusive, if there are no outcomes in common: \(A \cap B = \{\emptyset\}\). The probability of disjoint events is 0: \(P(A \cap B) = 0\).
Union: Given two events, \(A\) and \(B\), the union of \(A\) and \(B\), denoted as \(A \cup B\), is the combination of all outcomes that are elements in \(A\) or \(B\).
Addition Rule: The probability of \(A \cup B\) is: \(P(A \cup B) = P(A) + P(B) − P(A \cap B)\). If \(A\) and \(B\) are disjoint, then \(P(A \cup B) = P(A) + P(B)\).
Independence: Two events, \(A\) and \(B\), are independent if the occurrence of one event does not affect the occurrence of the other. In probabilistic terms, if the \(P(A)\) is the same regardless of \(P(B)\), and vice versa, then the two events are independent.
Multiplication Rule: If two events, \(A\) and \(B\), are independent then the probability of both occurring is the product of the two individual probabilities. That is: \(P(A \cap B) = P(A) \cdot P(B)\). This rule scales to \(n\) independent events.
Conditional Probability: For two events, \(A\) and \(B\), the conditional probability of \(A\) given \(B\) is the probability that the event \(A\) occurs given that the event \(B\) has already occurred. The notation for conditional probability is: \(P(A|B)\) (the probability of \(A\) given \(B\)).
The conditional probability \(P(A|B)\) is defined as:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
whereas the conditional probability \(P(B|A)\) is defined as:
\[ P(B|A) = \frac{P(A \cap B)}{P(A)} \]
and, \(P(A|B) \ne P(B|A)\). So, the question becomes, are these conditional probabilities, \(P(A|B)\) and \(P(B|A)\) related? A careful (or perhaps not so careful) study of the above equations shows that a common term in both is \(P(A \cap B)\). If one was to solve for \(P(A \cap B)\) in the \(P(B|A)\) equation:
\[ \begin{align} P(B|A) &= \frac{P(A \cap B)}{P(A)} \\ \\ P(A \cap B) &= P(B|A)P(A) \end{align} \]
and substitute into the \(P(A|B)\) equation yields:
\[ \begin{align} P(A|B) &= \frac{P(A \cap B)}{P(B)} \\ \\ P(A|B) &= \frac{P(B|A)P(A)}{P(B)} \end{align} \]
The conditional probability \(P(A|B)\) is now expressed in terms of the conditional probability \(P(B|A)\).
Repeating this idea, but now solving for \(P(A \cap B)\) in the \(P(A|B)\) equation and then making the appropriate substitution yields:
\[ \begin{align} P(A|B) &= \frac{P(A \cap B)}{P(B)} \\ \\ P(A \cap B) &= P(A|B)P(B) \\ \\ P(B|A) &= \frac{P(A \cap B)}{P(A)} \\ \\ P(B|A) &= \frac{P(A|B)P(B)}{P(A)} \end{align} \]
The conditional probability \(P(B|A)\) is now expressed in terms of the conditional probability \(P(A|B)\). Of course, \(P(B|A)\) could have been directly determined algebraically from \(P(A|B)\):
\[ \begin{align} P(A|B) &= \frac{P(B|A)P(A)}{P(B)} \\ \\ P(B|A)P(A) &= P(A|B)P(B) \\ \\ P(B|A) &= \frac{P(A|B)P(B)}{P(A)} \end{align} \]
This expression of the relationship between conditional probabilities is Bayes’s Theorem. The general definition, thus, is:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
Bayes’ Theorem describes the fixed relationship between \(P(A)\), \(P(B)\), \(P(A|B)\) and \(P(B|A)\). Sometimes these are known as the flipping formulas for it allows the direct computation of one conditional probability from the other - that is, if one conditional probability is known, Bayes’ Theorem provides a mechanism to flip the direction of the condition. This ability to flip conditional probabilities is a key property, and use, of Bayes’s Theorem.
There are other ways to express Bayes’ Theorem as will be explored in the Breast Cancer and the Bayesian Inference Introduction examples below.
A very useful tool is the creation of a conjoint table with joint and marginal probabilities. Consider the (related) events \(A_{1}\) and \(A_{2}\) and the (related) events \(B_{1}\) and \(B_{2}\) organized in a 2 x 2 table as shown here (note, for simplicity using a 2 x 2 table but the idea scales as will be seen in the Breast Cancer example below):
Each cell contains the corresponding empirical data relative to the cell. The rows are summed as are the columns and a grand total being the sums of all the rows and columns. Each cell can then be expressed as a probability as shown here:
Each cell within the body of the table is the intersection of the two events represented by that cell. Hence, each cell represents the joint probability. The probabilities for each row and column are the marginal probabilities for they are in the margin of the table. A conjoint table with joint and marginal probabilities, then, is as shown here:
Note, that the marginal probability \(P(B_{1})\), for example, is the some of the joint probabilities in that row:
\[ P(B_{1}) = P(A_{1} \cap B_{1}) + P(A_{2} \cap B_{1}) \]
Similiar sums exist for the other marginal probabilities.
Some points to note:
As an example of a conjoint table with joint and marginal probabilities and its usefulness, consider the marital status of individuals in the US 15 years of age or older (US Census data, 2013). The following table shows the break down of males and females for each of the five marital status categories: never been married, married, widowed, divorced or separated.
Having this empirical data thus represented, the probabilities of each cell can be computed:
Knowing that
a conjoint table with joint and marginal probabilities can be produced as follows:
A conjoint table with joint and marginal probabilities is quite useful in computing conditional probabilies:
Given a person who was selected at random was male, what is the probability that he has never been married? \[ \begin{align} P(\text{Never | Male}) &= \frac{P(\text{Male} \cap \text{Never})}{P(\text{Male})} \\ \\ &= \frac{0.16554}{0.48747} \\ \\ &= 0.33959 \end{align} \]
Given a person who was selected at random was male, what is the probability that he is divorced? \[ \begin{align} P(\text{Divorced | Male}) &= \frac{P(\text{Male} \cap \text{Divorced})}{P(\text{Male})} \\ \\ &= \frac{0.04377}{0.48747} \\ \\ &= 0.08980 \end{align} \]
Given a person who was selected at random was female, what is the probability that she is widowed? \[ \begin{align} P(\text{Female | Widowed}) &= \frac{P(\text{Female} \cap \text{Widowed})}{P(\text{Female})} \\ \\ &= \frac{0.04457}{0.51253} \\ \\ &= 0.08696 \end{align} \]
Next consider a slightly different question from the first one posed above:
\[ \begin{align} P(\text{Male | Never}) &= \frac{P(\text{Male} \cap \text{Never})}{P(\text{Never})} \\ \\ &= \frac{0.16554}{0.31276} \\ \\ &= 0.52994 \end{align} \]
This is not the same as the probability that the person has never been married given that the person is male. It is the flipped conditional probability! To see Bayes’ Theorem in action, assume that all that was known was the conditional probability \(P(\text{Never | Male})\) but the interest in knowning the conditional probablity \(P(\text{Male | Never})\). Applying Bayes’ Theorem:
\[ \begin{align} P(\text{Male | Never}) &= \frac{P(\text{Never | Male})P(\text{Male})}{P(\text{Never})} \\ \\ &= \frac{(0.33959)(0.48747)}{0.31276} \\ \\ &= 0.52994 \end{align} \]
This value of \(P(\text{Male | Never}) = 0.52994\) matches that which was computed from the conjoint table with joint and marginal probabilities, as it should! For recall, that Bayes’ Theorem established the relationship between \(P(A|B)\) and \(P(B|A)\).
Not to minimize breast cancer in any way, the following is a classic example and is worth study. Given the following data:
The problem statement is:
The first way to go about answering this question is to create the beginnings of a conjoint table with joint and marginal probabilities:
and begin filling in with the given information.
From this information, the remaining cells of the table are easily computed:
Having the cells filled in the probabilities can be easily computed, thus producing the conjoint table with joint and marginal probabilities:
From the problem statement, the conditional probability \(P(\text{Cancer | Test Pos})\) is desired and using the table:
\[ \begin{align} P(\text{Cancer | Test Pos}) &= \frac{P(\text{Cancer} \cap \text{Test Pos})}{P(\text{Test Pos})} \\ \\ &= \frac{0.016}{0.114} \\ \\ &= 0.14035 \end{align} \]
So, roughly \(14\%\) of the women of this age group who participate in a routine screening and receive positive mammograms actually have breast cancer.
Let’s look at another view of this using Bayes’ Theorem. Recognize that
\[ \begin{align} P(\text{Cancer | Test Pos}) &= \frac{P(\text{Test Pos | Cancer})P(\text{Cancer})}{P(\text{Test Pos})} \\ \\ &= \frac{\left(\frac{0.016}{0.02}\right)(0.02)}{0.114} \\ \\ &= \frac{0.8(0.02)}{0.114} \\ \\ &= 0.14035 \end{align} \]
This value of \(P(\text{Cancer | Test Pos}) = 0.14035\) matches that which was computed from the conjoint table with joint and marginal probabilities, as it should!
In this particular case, the information provided was the probability of a positive test result given the existance of cancer. What was really asked was the probability of cancer given the positive test result. Bayes’ Theorem provided the means to flip the conditional probabilities, showing, again, that Bayes’ Theorem establishes the relationship between \(P(A|B)\) and \(P(B|A)\)! It is often the case that one conditional probability is easier to determine, or estimate, than the other. Bayes’ Theorem allows one to take full advantage of this!
Of note is that there are other ways of expressing Bayes’ Theorem. For though quite useful in the computation of conditional probabilities, these other ways can lead to other powerful uses, such as inference.
Assume have an event \(A\), then also have its complement, \(\text{~}A\). Likewise for an event \(B\). The corresponding conjoint table with joint and marginal probabilities is:
Bayes’ Theorem can be expressed in a slightly different form:
\[ \begin{align} P(A|B) &= \frac{P(B|A)(P(A))}{P(B)} \\ \\ &= \frac{P(B|A)P(A)}{P(A \cap B) + P(\text{~}A \cap B)} \\ \\ &= \frac{P(B|A)(P(A))}{P(B|A)P(A) + P(B|\text{~}A)P(\text{~}A)} \end{align} \]
This form of Bayes’ Theorem is key for use in Bayesian Inference. (Note in the general form of Bayes’ Theorem, the knowledge or determination of \(P(B)\) - which is the probability of the event being conditioned on - is either unknown or difficult to acquire. The expansion into \(P(B|A)P(A) + P(B|\text{~}A)P(\text{~}A)\) often is computable. The example here showed its derivation from the logic of the table, but it is a direct consequence of the law of Total Probability.)
Applying this to our above breast cancer example, define the events \(A\) and \(B\):
To answer the question, “How many of the women of this age group who participate in a routine screening and receive positive mammograms actually have breast cancer?”, need to find \(P(A|B)\). From the information given, can determine:
\[ P(A) = \frac{200}{10000} = 0.02 \]
\[ P(\text{~}A) = (1 - 0.02) = 0.98 \]
\[ P(B|A) = \frac{160}{200} = 0.8 \]
\[ P(B|\text{~}A) = \frac{980}{9800} = 0.1 \]
Hence,
\[ \begin{align} P(A|B) &= \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\text{~}A)P(\text{~}A)} \\ \\ &= \frac{0.8(0.02)}{0.8(0.02) + 0.1(0.98)} \\ \\ &= \frac{0.016}{0.016 + 0.098} \\ \\ &= 0.14035 \end{align} \]
In Bayesian Inference terms there are two competing hypotheses: \(A\) and \(\text{~}A\) (will limit this discussion to only two). To each of these, probabilities are assigned, \(P(A)\) and \(P(\text{~}A)\). These are known as the prior probabilities. Then new data is collected (or observed), \(B\). From this, need to compute the likelihood of observing the data under each of the two hypotheses, \(P(B|A)\) and \(P(B|\text{~}A)\) (this is the hardest part). Finally, use Bayes’ Theorem to update the probability of the hypothesis of interest given the data, known as the posterior probability.
Let’s revisit:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\text{~}A)P(\text{~}A)} \]
or
\[ \text{Posterior Probability of hypothesis A given data B equals} \]
\[ \frac{\text{(Likelihood of data B given hyp A)(Prior pobability of hyp A)}}{\text{(Likelihood of data B given hyp A)(Prior pobability of hyp A)} + \text{(Likelihood of data B given hyp ~A)(Prior pobability of hyp ~A)}} \]
So,
\[ P(B|A) = \text{Likelihood of data B given hyp A} = 0.8 \]
\[ P(A) = \text{Prior pobability of hyp A} = 0.02 \]
\[ P(B|\text{~}A) = \text{Likelihood of data B given hyp ~A} = 0.1 \]
\[ P(\text{~}A) = \text{Prior pobability of hyp ~A} = 0.98 \]
\[ P(A|B) = \text{Posterior Probability of hypothesis A given data B equals} = 0.14035 \]
This interpretation is how a subjective belief can rationally change to account for new data (observations).
A lot more will be said about this in a subsequent post(s), inclusive of MCMC (Markov Chain Monte Carlo).
Unfortunately, there is not always a clear understanding of conditional probabilities (and good statistical reasoning). In some cases, that misunderstanding can arguably be intentional, to achieve some higher agenda. Regardless, this misunderstanding can, and often does, lead to some very unfortunate consequences, especially when encountered in courts of law. For this reason, the term, Prosecutor’s Fallacy, has come to be known as the general term for this type of situation (there are other less specific terms like inversion of the conditional).
In general, the Prosecutor’s Fallacy is the belief that \(P(A|B) = P(B|A)\).
To see the prosecutor’s fallacy in action consider the following commonly used hypothetical example where a defendent is on trial for some crime based on this person matching the collected evidence. Define the following events:
Events:
The relevant conditional probabilities are:
Assumptions:
Givens:
Prosecutor Action:
Trial Verdict:
Problem Statement:
What was really needed for the jury to hear, and understand, was what was the probability that the collected evidence could be attributed to an innocent defendent. That is, \(P(I|E)\). Enter Bayes’s Theorem which states:
\[ \begin{align} P(I|E) &= \frac{P(E|I)P(I)}{P(E)} \\ \\ &= \frac{P(E|I)P(I)}{P(E|I)P(I) + P(E|\text{~}I)P(\text{~I})} \end{align} \]
From the information presented, the probabilities needed are:
Hence, can now compute \(P(I|E)\) using Bayes’ Theorem:
\[ \begin{align} P(I|E) &= \frac{P(E|I)P(I)}{P(E)} \\ \\ &= \frac{P(E|I)P(I)}{P(E|I)P(I) + P(E|\text{~}I)P(\text{~I})} \\ \\ &= \frac{0.00001(0.999998)}{0.00001(0.999998) + 1(0.000002)} \\ \\ &= \frac{0.00001}{0.00001 + 0.000002} \\ \\ &= \frac{0.00001}{0.000012} \\ \\ P(I|E) &= 0.83333472 \end{align} \]
Thus, there is an (approximate) 83% (or 5 out of 6) chance that given the evidence a person who matched the evidence is innocent! This, implies an (approximate) 17% (or 1 out of 6) chance of guilt. Which is a far cry from the 1 out of 100,000 chance alledged by the prosecutor!
The prosecutor’s fallacy (as understood to be a misuse, or misunderstanding, of conditional probabilities) is a very real phenomena in not only the judicial system but in many other disciplines as well - basically anywhere where there is evidence and the use of that evidence as applied to some defendent (application of the terms evidence and defendent are relative to the context of the problem). Avoidance of this misuse (more misunderstanding) can be realized by ensuring that the sought after probability is answering the right question(s). That is, it is important to see how the evidence applies to the defendent, not the other way around.
Any introduction to Bayes’ Theorem would be quite remiss if at least some brief information on Bayes himself was not given.
Thomas Bayes (1701-1761) was an English mathematician and Nonconformist theologian (Nonconformists were Protestant Christians who separated from the Anglican Church). He was the (apparent) first who defined a way to use probability inductively and a mathematical basis for probability inference, That is, he defined a way to calculate the probability that an event will occur in future trials based on the frequencey of occurance in prior trials.
It wasn’t until the late 1740s that he did the work that now exists as a theorem bearing his name. Bayes did not publish his work, either by choice or by death. It is thought that Bayes may have not seen any immediate practical value to his work relative to the times in which he lived. After his death in 1761, his friend, Richard Price (1723-1791), discovered Bayes’ work and had it published as “An Essay towards solving a problem in the doctrine of chances in the Philosophical Transactions of the Royal Society of London 53 (1763). For anyone interested, the original published article can be read by visiting the Royal Society Publishing organization here: An Essay towards solving a problem in the doctrine of chances.
Bayes’ work was largely ignored. One reason, perhaps, was that the frequentist view of statistics was the predominant view of the statistical community. The Bayesian approach was subjective and thought to be unscientific. Another potential reason was that the Bayesian approach involved computations that were not readily done. With recent advances in computing technology, the Bayesian approach has (is) become more widely understood and accepted. Indeed, the Bayesian approach is used to forecast weather, detect forgeries, identify spam emails, among many others.
To explore the use of the theorem that would not die, I highly recommend reading the excellent book, “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines & Emerged Triumphant from Two Centuries of Controversy”, by Sharon Bertsch McGrayne (Yale University Press, 2011). It is a great and fascinating read!
For a very thorough biography of Thomas Bayes, I refer the reader to the paper, The Reverend Thomas Bayes, FRS: A Biography to Celebrate the Tercentenary of His Birth, by D. R. Bellhouse.
Contained herein was a very brief introduction to Bayes’ Theorem
\[ P(A|B) = \frac{P(B|A)(P(A))}{P(B)}. \]
Hopefully insight was gleaned into not only understanding the theorem but an appreciation of its use and applicability. Repeating from above, to explore the use of the theorem that would not die, I highly recommend reading the excellent book, “The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines & Emerged Triumphant from Two Centuries of Controversy”, by Sharon Bertsch McGrayne (Yale University Press, 2011). It is a great and fascinating read!
There is so much more to delve into. Discussion of prior and posterior probabilities and likelihood functions are just a few topics. In addition, Markov Chain Monte Carlo (MCMC) techniques need presentation. These all will be discussed in subsequent posts (MC Integration and MCMC are currently in development).