Machine learning is advancing rapidly, and ever growing data is at the center of this evolution. But what do you do if you don’t have a massive dataset to start with? This is the second of two talks we held at a machine learning meetup jointly hosted by Telia and Squeed, based on real projects where data was limited. In this talk the topic was Bayesian belief networks, a type of statistical model that can be used for highly data-efficient learning.

#### Bayesian Belief Networks

A Bayesian belief network is a statistical model over variables $\{A, B, C…\}$ and their conditional probability distributions (CPDs) that can be represented as a directed acyclic graph. Some of the strengths of Bayesian networks are:

- They can be used initially without any data
- They can use data efficiently for learning
- They are interpretable

On the other hand, they typically require a domain expert to estimate the initial structure and parameters of the network, and there is no one-size-fits-all way to build them and run inference. The strong prior provided by the initial structure and parameters in practice means that the distributions that the model can learn during training is limited, compared to something like a neural network. It is this assumption of what the desired distribution looks like that allows it to use data so efficiently, but the downside is that if this does not correspond well to the true distribution, we may fail to properly model it at all. As always, no information at all is better than false information.

The classic example of a Bayesian network is the *Student network*, which is a model over a student taking a course. The variables in this model is the student’s SAT score $S$, their intelligence $I$, the difficulty of the course $D$, the grade the student receives $G$, and whether or not they get a recommendation letter $L$.

For each variable, we have a conditional probability distribution over its states given the states of its parents. The probability for a given state of the network is given by

$P(S, I, D, G, L) = P(L|G) \times P(G|D,I) \times P(D) \times P(S|I) \times P(I)$

Given some evidence, that is some observed variables, we can infere the distributions over the other variables. The beauty lies in the fact that we can use Bayes theorem to follow the causal dependencies backwards as well as forward. Let us say for example that we observed that the student scored well on their SATs. We can now update our belief over the probability that the student is intelligent from the prior belief of 30% to a new estimate of 87%. This new estimate will in turn affect our belief of the grade the student will achieve, which influences the probability of receiving a letter. Hence, knowing that the student scored well on their SATs leads us to believe there is a higher probability that they will receive a letter.

#### Inference

In the example above, where the students SAT score had been observed, I stated that our belief $P = 0.87$. Where did this number come from? More generally, how do we infer marginal probabilities for the unobserved nodes given some evidence? There exist both exact and approximate algorithms to solve this task, and which one is the most suitable depends on the structure of the network as well as other practical constraints such as acceptable runtime, etc. Variable elimination is an exact algorithm that can be used to answer questions about the marginal distribution of a single variable, by iteratively applying a sum-product operation to incorporate the belief of a variable into the beliefs of its neighbours until only the variable of interest remains. Belief propagation is a message passing algorithm that can be seen as a dynamic programming variation of variable elimination, which reuses common terms for when we want the marginals of all the variables. If our network is tree-structured, the belief propagation algorithm is exact and takes two iterations of propagation. Otherwise we can either accept that the algorithm is now approximate and run it until it converges (which might never happen or give an incorrect result), or we turn our initial Bayesian network into a clique tree which is tree-structured, but might have large cliques where the sum-product calculations need to sum over a large number of states.

We can also use general methods for estimating probability distributions such as Monte-Carlo Markov Chains (MCMC) or Variational Inference. If we turn to MCMC, one difficulty will be the highly localized and often high-dimensional nature of our probability distribution, where methods such as the No U-turn sampler may be helpful.

#### Scaling Bayesian networks

Most Bayesian networks of real interest are much larger than the student network. Once we get to variables that have a high number of parents, the conditional probability table – which is spanned by the cartesian product of the state spaces of the variable and all its parents – quickly become prohibitively large. A binary variable with four parents that are also binary already has a conditional probability table with $2^5$ entries. Asking an expert to provide $2^5$ likelihoods for one variable is already slow, and once we start asking for $2^{16}$ we might as well go home! Luckily, there are often structure in these distributions that we can exploit. A common example is a binary variable with several parents, where one of its parents is sufficient to cause itself. In this case we can use a noisy-or approximation, where we only give the probabilities for each parent that they cause the variable and then use as conditional probability the probability that none of the parents is causing the variable.

#### Continuous variables

In the example above, all variables had discrete states. What if a variable naturally is continuous, such as time or distance? There is nothing in the formulation of our Bayesian network that prohibits the use of general functions to describe our priors and CPDs. However, there are practical implications which complicates their use. First of all, we need some way to parametrize our variables. If they all belong to the same parametric family this is not an issue, but in practice it is often difficult to find a parametric family that fits all of our variables, especially since they need to remain within this family under the multiplication and marginalization operations used during inference. Secondly, we exchange the summation to integration during marginalization, which often has no closed form. One alternative that we use in practice is to simply discretize the continuous variables. This approach is dependent on the function being smooth enough that a discretization is a good approximation.

#### Learning

The title mentions machine learning, but so far all we have is a – albeit sophisticated – statistical model. Although we can use this model as is, another point of view is that what we have is a prior over our final model, to be completed by data. It is possible both to learn the priors and conditional probabilities given the network structure, and to learn the structure itself. Typically, we would assume that the expert is quite good at knowing the structure of the network, and perhaps not as good at providing parameters (probabilities) for the priors and CPDs. If the expert, rather than giving point estimates for the priors and conditional probabilities, give distributions over these parameters $\theta$ we can use a Bayesian treatment of the parameters too:

$P(\theta | D) = \frac{P(D | \theta) \times P(\theta)}{P(D)}$

In this way we will preferentially update (in order to increase the likelihood of the observed data) the parameters which the expert considered uncertain. Because of the constraints that our initial structure and parameter estimates provide, the model can use data very efficiently for learning.

We can also learn about the structure of the network. Similarly to how boosted trees are grown, we attach a cost function to the structure of the network which encourages a small sparsely connected network. We can then add nodes or edges that provide an increase in the likelihood of the data that offsets the increase in the cost function, and similarly prune existing nodes and edges that do not.

#### Conclusion

Bayesian belief networks is a class of highly data efficient and interpretable models for domains with causal relationships between variables. The tradeoff is a dependency on good prior knowledge and often problem-specific adaptions and simplifications. For the problems where their strengths shine however, belief networks are well worth their trouble.