If Google were created from scratch today, much of it would be learned, not coded.
—Jeff Dean, Google Senior Fellow, Systems and Infrastructure Group
Machine learning, or ML, is all the rage today, and there are good reasons for that. Models created by machine-learning algorithms for problems such as spam filtering, speech and image recognition, language translation, and text understanding have many advantages over code written by human developers. Machine learning, however, is not as magical as it sounds at first. In fact, it is rather analogous to how human developers create code using test-driven development.4 Given a training set of input-output pairs {(a,b)|a ∈ A, b∈B}, guess a function f ∈ A → B that passes all the given tests but also generalizes to unspecified input values.
A big difference between human-written code and learned models is that the latter are usually not represented by text and hence are not understandable by human developers or manipulable by existing tools. The consequence is that none of the traditional software engineering techniques for conventional programs such as code reviews, source control, and debugging are applicable anymore. Since incomprehensibility is not unique to learned code, these aspects are not of concern here.
A more interesting divergence between machines and humans is that machines are less arrogant than humans, and they acknowledge uncertainty in their code by returning a probability distribution or confidence interval of possible answers f ∈ A → ℙ(B)
instead of claiming to know the precise result for every input. For example, a learned image-recognition function by a major cloud provider will predict with 95% certainty that I have hair, but is less confident about whether or not I am professional (Figure 1).
The implication of incorporating learned models in human-written code is that you cannot get around the fact that the building blocks from which humans compose applications are fundamentally probabilistic. This is a challenge for main-stream programming languages, which all assume that computations are precise and deterministic. Fortunately, the 18th-century Presbyterian minister Thomas Bayes anticipated the need for dealing with uncertainty and formulated Bayes' rule:6
ℙ(A|B)*ℙ(B) = ℙ(A&B) = ℙ(B|A)*ℙ
(A)
As it turns out, Bayes' rule is exactly what the doctor ordered when it comes to bridging the gap between ML and contemporary programming languages.
Many of the mathematical explanations of Bayes' rule are deeply confusing for the working computer scientist, but, remarkably, when interpreted from a functional programming point of view, Bayes' rule is a theorem about composability and invertibility of monadic functions. Let's break down Bayes' rule piece by piece and rebuild it slowly based on developer intuition.
First let's explore what probability distributions ℙ(A)
are. The Wikipedia definition, "a probability distribution is a mathematical description of a random phenomenon in terms of the probabilities of events," is rather confusing from a developer perspective. If you click around for a bit, however, it turns out that a discrete distribution is just a generic list of pairs of values and probabilities ℙ(A)=[A ↦ ℝ]
such that the probabilities add up to 1. This is the Bayesian representation for distributions. Isomorphically, you can use the frequentist representation of distributions as infinite lists of type dist ∈ [A]
, as n gets larger, sampling from the collection and counting the frequencies of each element from a in dist.Take(n) group by a into g select g.Key a g.Sum ()/n
approximates the Bayesian representation of the distribution. When converting from the Bayesian to the frequentist implementation, the probabilities do not to have to add up to 1, and the sampling process will ensure the ratios are properly normalized.
Like true mathematicians, we will silently switch between these two representations of distributions whenever convenient. Unlike mathematicians, however, to keep things simple we will not consider continuous distributions. We want our distribution to hold generically any type A
, and most of the types we deal with in code are discrete and not "measurable" or real number-like.
Because the values we care about are usually not even comparable, we also will avoid cumulative distributions. One reason that mathematicians like standard continuous distributions—such as Gaussian, beta, binomial, and uniform—is because of their nice algebraic properties, called conjugate priors.2 For example, uniform prior combined with a binomial likelihood results in a beta posterior. This makes 18th- and 19th-century probabilistic computations using pencil and paper feasible, but that is not necessary now that there are powerful computers that can run millions of simulations per second.
In programming examples, distributions typically come from outside data as discrete frequentist collections of data with an unknown distribution, or they are defined explicitly as a Bayesian representation by enumerating a list of value/probability pairs. For example, here is the distribution of weight of adults in the United States, according to the Centers for Disease Control (CDC):
CDC ∈ ℙ (Weight)
CDC = [obese ↦ 0.4, skinny ↦ 0.6]
Efficiently sampling from composed distributions is, indeed, rocket science. Just like database query optimizers, advanced sampling methods leverage properties of the leaf distributions and the structure of the query9 or program3 that computes the distribution. It leverages deep and complex mathematical techniques such as importance sampling, Metropolis-Hastings, Markov Chain Monte Carlo, and Gibbs sampling that are far outside the scope of this article but are important for making real-world computations over probability distributions feasible. As Bayesian analysis consultant John D. Cook remarked "... Bayesian statistics goes back over 200 years, but it did not take off until the 1980s because that's when people discovered practical numerical methods for computing with it ..."
To illustrate the sophistication involved in efficiently sampling known discrete distributions, imagine converting the example distribution CDC into a frequentist representation.8
Perhaps the most obvious method stacks the columns for skinny and obese on top of each other and draws one random number—say, p
—between 0 and 1 and then checks if p ≤ 0.4
yields obese
, and otherwise yields skinny
. In general, this search is linear in the number values in the distribution, but using tricks like binary search tree can speed things up. Mathematicians call this the inverse transfer method.
Another way is first to select a random integer—say, weight
—to select between obese
and skinny
, and then choose a random double between 0 and 1—say, p—and if CDC[weight] ≤ p
, then yield weight
, as shown in Figure 2. Mathematicians call this algorithm rejection sampling, and as the histogram shows, half of the attempts to sample a value from the distribution will fail (the pink part). This can be improved by picking a tighter envelope distribution, like that in the second histogram, but that still rejects two out of 12 samples.
Efficiently sampling from composed distributions is, indeed, rocket science.
The last method pads the lower probabilities by borrowing from the higher probabilities. Amazingly, it is always possible to do this in a way such that every column represents the probabilities for, at most, two values, so we need only one comparison to pick the right value. This comparison can be implemented using a second index table, and hence mathematicians call this sampling algorithm the alias method.
Now that we have explained probability distributions ℙ(A)
, let's examine conditional probability distributions ℙ(B|A)
, which, according to Wikipedia, are "a measure of the probability of an event given that (by assumption, presumption, assertion, or evidence) another event has occurred." To developer ears that sounds exactly like a function A→ℙ(B)
that returns a distribution, just like a learned model. The remainder of this article uses the notations ℙ(B|A)
and A→P(B)
interchangeably.
Going back to the example, we have the following model Doctor ∈ ℙ(Food|Weight)
of food preference, given weight, that we could have obtained by asking patients what kind of food they consume:
Doctor ∈ ℙ(Food|Weight)
= Weight → ℙ(Food)
Doctor(obese)
= [burger↦0.9, celery↦0.1]
Doctor(skinny)
= [burger↦0.3 celery↦0.7]
As argued in the introduction, these probabilistic functions, such as ℙ(Object|Image
), ℙ(Text|Audio)
, ℙ(Spam|Text
), and so on, increasingly are the result of training some ML algorithm or neural net, instead of being coded by expensive and flaky human developers.
Now that you know that conditional probabilities are probabilistic functions, things are starting to get interesting, since this means that multiplication (*) used in Bayes' rule is an operator that applies a probabilistic function to a probability distribution as a parameter—that is, it has the following type signature:
ℙ (B|A)*ℙ (A) ∈ ℙ (A&B)
Using the Bayesian representation of distributions, you can implement a probabilistic function application likelihood*prior
where likelihood∈ℙ(B|A)
and prior∈ℙ(A)
, using the following correlated query:
likelihood*prior =
from a↦p in prior
from b↦q in likelihood(a)
select (a,b)↦p*q
Applying this definition to compute the result of Doctor*CDC
, we obtain the table shown in Figure 3 for the joint probability distribution ℙ(Food&Weight
).
Because the distributions for ℙ(Weight
) and ℙ(Food
) appear in the margins of this table, mathematicians call them marginal probabilities, and similarly the process of summing up the columns/rows is called marginalization. When computing a joint distribution using (*), mathematicians often use the name likelihood for the function and prior for the argument.
The beauty of the frequentist representation is that there is no need for multiplying probabilities. Sampling ensures the underlying ratio of occurrence of values in the result will automatically reflect the proper product of values from the prior and likelihood. For example, if we implement the prior CDC by an infinite collection with odds obese:skinny = 4:6
, and the result of Doctor(skinny) by an infinite collection with odds burger:celery = 3:7
, and, respectively, that of Doctor(obese)
by a collection with odds burger:celery = 9:1
, then sampling from the infinite collection Doctor*CDC
, which results from applying the prior to the likelihood, will have a ratio:
(obese:burger):
(obese,celery):(skinny,burger):
(skinnny:celery) = 36:4:18:24.
The keen reader will note that (*) is a slight variation of the well-known monadic bind operator, which, depending on your favorite programming language, is known under the names (>>=), SelectMany
, or flatMap
. Indeed, probability distributions form a monad. Mathematicians call it the Giry monad, but Reverend Bayes beat everyone to it by nearly two centuries.
Note that as formulated, Bayes' rule has a type error that went unnoticed for centuries. The left-hand side returns a distribution of pairs ℙ(A&B)
, while the right-hand side returns a distribution of pairs ℙ(B&A)
. Not a big deal for mathematicians since & is commutative. For brevity we'll be sloppy about this as well. Since we often want to convert from ℙ(A&B)
to ℙ(A)
or ℙ(B)
by dropping one side of the pair, we prefer the C#-variant of SelectMany
that takes a combiner function A⊕B∈C
to post-process the pair of samples from the prior and likelihood:
likelihood*prior =
from a↦p in prior
from b↦q in likelihood (a)
select a⊕b↦p*q
Now that we know that (*) is monadic bind, we can start using syntactic sugar such as LINQ queries or for/monad comprehensions. All that is really saying is that it is safe to drop the explicit tracking of probabilities from any query written over distributions (that is, the code on the left in Figure 4 is simply sugar for the code on the right, which itself can be alternatively implemented with the frequentist approach using sampling).
Another way of saying this is that we can use query comprehensions as a DSL (domain-specific language) for specifying probabilistic functions. This opens the road to explore other standard query operators besides application that can work over distributions and that can be added to our repertoire. The first one that comes to mind is filtering, or projection as the mathematicians prefer to call it.
Given a predicate (A→
No entries found