Probabilistic programming with Pyro and conclusive proof that Poland was robbed at the 2018 Chess Olympiad
Yes. I know what you’re thinking. “Chess is the most important sport on the planet. Perhaps the most important activity there is. If you claim that Poland played better than what their fourth place indicates, you’d better have proof. You’d. Better. Get. This. Right.”
This will be a long post, wherein I
- Explain “Why Pyro”
- Describe chess and the Olympiad in excruciating detail
- Finally get to the code part, where we model the result of a chess tournament
- Solve the model in various ways
- Download Chess Olympiad data and calculate the strongest teams
- Draw conclusions!
Pyro
To get this right, I’d like to use probabilistic programming and Pyro.
At Tenfifty, we like Pytorch. Compared to Tensorflow, the eager execution feels much more like Python programming. When I have done probabilistic programming in the past, I have generally used PyMC3, which is nice enough. However, being built on top of Theano, it does suffer from the same problem as Tensorflow. The compiling and lazy execution can lead to strange error messages and a suboptimal REPL experience. Also Theano is not developed anymore.
Enter Pyro. Pyro is a new probabilistic programming library, built on top of Pytorch. It is still in alpha, but seems to work well. Pytorch was recently released in a 1.0 preview, which led me to do this experiment in Pytorch 1.0 (preview) and the branch of Pyro that supports Pytorch 1.0.
I learned Pyro as I worked on this, so I may have gotten some details wrong. Probably not, though.
Chess background
So, what will we use Pyro for? Recently, the 2018 Chess Olympiad was concluded. It was generally exciting, as chess tends to be, but lead to the type of discussions that seems to be the point of sports – “Did the right team win?”, “Is the system fair?”, “Could this perhaps be ruined with enough math and programming?”.
You see, the Chess Olympiad is played as a team sport. Four players from one country play four players from another country. Four games in total and you sum up the results (draw counts as 0.5). 184 teams entered the Open Section (there is also a Women’s Section, but since everyone is eligible to play in the Open, I will focus on that one), almost all of them from separate countries. They play 11 rounds. Each round, countries get paired in matches against other countries with the same score so far. This system is used in many sports and is called a Swiss pairing. The devil is in the details, however, and this year the pairing was sort of odd.
After the tournament had concluded, there were three teams with the same number of wins. Who was the tournament winner? This is decided via a complex calculation of tiebreaks, using the results of the teams that each of the three teams have played, to see who had the toughest opposition. My first question, then, is Was this procedure fair?
The second question regards Poland. Poland was the tournament surprise. For nine rounds they played the best chess and were the tournament leaders. They met all the best teams and didn’t lose a single match, until round 10. After the full eleven rounds, they placed fourth. No medal for Poland. Strong grandmasters and people in general on Twitter seemed to regard this as a scandal. Had Poland not shown that they were worthy of a medal? They had played the strongest opposition by far and faltered just at the finish line. The system was unfair! Should Poland have gotten a medal?
Being a curmudgeonly and generally unpleasant sort of fellow, I though to myself “Bah. Humbug. They just favor the underdog with their filthy emotions. I bet fourth place was even too good for that uppity Polish team. Let’s spend several hours proving that someone on the Internet is wrong.” Also I thought it would serve humanity to show that Sweden is better than Norway at chess.
The Elo system
60 years ago, Arpad Elo devised the Elo system. The basic idea is that a player’s strength can be expressed through a number. In a particular game, the player’s (unmeasurable) performance is drawn from a normal distribution around the player’s rating with a standard deviance such that 100 points difference in rating should lead to a 64% score over time.
Below, I have chosen to write the formula directly and return the probabilities for player 1-win, draw and player 2-win. We don’t have to do this, though. Instead, I could have encoded how I believe the result is generated from probability distributions, as described above. Elo didn’t specify what percentage of the expected score will come from draws and what part will come from decided games, so I have left this as a parameter.
I think the point of a competition is to estimate the true Elo of the participants, at this particular point in time, using no knowledge of their prior performance. This means that a good pairing system should be aimed at resolving this question and that a good tiebreak system should favour the player with the best estimated performance. From a sporting point of view it is also good if a tiebreak system is easy to understand, so I don’t actually propose this computation as a better method in practice. Just as a measure.
import torch
def chess_game(p1_elo, p2_elo, draw_ratio):
diff = p1_elo - p2_elo
p1_winp = 1 / (1 + 10 ** (-diff/400))
p2_winp = 1 - p1_winp
minp = 1 / (1 + 10 ** (abs(diff)/400))
drawp = minp * torch.clamp(draw_ratio, 0.01, 0.99)
return torch.stack([p1_winp - drawp, 2 * drawp, p2_winp - drawp], dim=1)
This is just a normal function that takes arguments and produces a 2D array of three probabilites (win1, draw, win2) per row in p1_elo and p2_elo. We write our functions for tensors rather than floats, so that we can calculate them for all games in a tournament (see below) at the same time.
Below we test it for two games 1800 vs 2000 and 1800 vs 1600. The total score, $win \times \frac{draw}{2}$, for the first player should be the same as the total score for the second in the second example, even though we tried different settings for draw_ratio.
chess_game(torch.tensor([1800.0, 1800]),
torch.tensor([2000.0, 1600]),
torch.tensor([0.5, 0.33]))
OK, looks good. In the second example we get less draws, since we gave a lower draw_ratio, and the higher rated player always have a higher win ratio.
The Tournament
Now, lets define a Tournament. A Tournament takes two tensors with players, p1s and p2s, each containing integers that identify each player. Finally it takes a tensor with results in those games. 0 for player_1-win, 1 for draw and 2 for player_2-win.
We also define a model. The model defines priors for the two unknown parameters which we are trying to find.
- The true draw_ratio
- A tensor of the true Elo rating for each player.
The pyro.sample function is used to tell Pyro that this is an unknown parameter. Since I don’t know anything about the tournament, I give the elos a wide prior, centered around 0 for each player, with a standard deviation of 300. This basically says that I don’t know what the rating is, but I’d be very surprised if the difference between players is more than 1000, and it is not going to be 10000.
We then send in the elos and the draw_ratio to our chess_game function and get a bunch of probabilities. Finally, we define a result and say that this parameter is observed. Pyro can use the observed data (the results of the game) to infer the most likely parameter settings.
Ignore the two guide methods for now. Also ignore why it says “.independent(1)” and “with pyro.iarange..” for a little while.
import pyro
import pyro.distributions as dist
pyro.enable_validation(True)
class Tournament:
def __init__(self, p1s, p2s, results):
self.p1s = torch.tensor(p1s)
self.p2s = torch.tensor(p2s)
self.results = torch.tensor(results)
self.n_players = max(self.p1s.max(), self.p2s.max()).item() + 1
def model(self):
draw_ratio = pyro.sample('draw_ratio', dist.Beta(5, 5))
elodist = dist.Normal(torch.zeros(self.n_players), 300)
elos = pyro.sample('elos', elodist.independent(1))
p = chess_game(elos[self.p1s], elos[self.p2s], draw_ratio)
with pyro.iarange("data_iarange", len(self.results)):
result = pyro.sample('result',
dist.Categorical(p),
obs=self.results)
def MAP_guide(self):
pd = pyro.param('pdraw_ratio', torch.rand(1) * 0.6 + 0.3)
pyro.sample('draw_ratio', dist.Delta(pd))
elo_guess = torch.rand(self.n_players) * 200 - 100
elos_mu = pyro.param("elos_mu", elo_guess)
pyro.sample('elos', dist.Delta(elos_mu).independent(1))
def norm_guide(self):
a = pyro.param('draw_alpha', torch.rand(1) * 10)
b = pyro.param('draw_beta', torch.rand(1) * 10)
pyro.sample('draw_ratio', dist.Beta(a, b))
elo_guess = torch.rand(self.n_players) * 200 - 100
sd_guess = torch.rand(self.n_players) * 200
elos_mu = pyro.param("elos_mu", elo_guess)
elos_sd = torch.clamp(pyro.param("elos_sd", sd_guess, 1.0))
elo_dist = dist.Normal(elos_mu, elos_sd)
pyro.sample('elos', elo_dist.independent(1))
Now, the magic begins. We gave a starting prior on our parameters, just giving the shape and size of our ignorance. Pyro can now use the results of the games to infer the posterior of our parameters. The posterior is statistics talk for what we should believe after the data, just as prior is what we believe before the data. The more data we have, the more the posterior will differ from the prior.
There are three main ways of calculating the posterior. Presented in order of increasing CPU usage.
1. Maximum a posteriori or “MAP”
MAP means that we are just interested in the most probable combination of parameters. The peak of our multi-dimensional distribution. A MAP is the cheapest and simplest to calculate, but the less data you have, the more interested you should be in how uncertain you are about the parameters. The only thing you need to calculate this is a decent starting point. This is what we define in MAP_guide, above.
Unlike PyMC3, Pyro has no simple way of calculating a MAP. We have to pretend that it is a SVI-calculation (see below), where each parameter is Delta distributed (just a point). The torch.rand-things are just starting points for the optimization.
2. Stochastic variational inference
In SVI, you are interested in how certain you are of your posterior. This is a somewhat harder problem. To help solve this, we give the algorithm two more things (this is what a norm_guide does, above), the approximate shape of our posterior and which parameters can be assumed to be independent of each other. The independence information can help the algorithm to solve the problem more efficiently.
The “with …” that I asked you to ignore earlier, just says that all my results can be assumed to be independent. A common assumption. The “independent(1)” says that the elo tensor is also independent along the first dimension. We don’t have to give this information, but it helps SVI. In theory – I didn’t check.
3. Markov Chain Monte Carlo
In MCMC, we don’t make any assumptions about the posterior distribution. We just ask the algorithm to find a good starting point (e.g MAP), jump around and sample points according to a criterion that gives us a good estimate of our posterior shape. In SVI, we often assume that our posterior parameter belief will be bell shaped around some midpoint, but MCMC can tell us, for example, if our posterior belief should be multi-modal (several peaks), or if the distribution is skewed to one side around our peak.
MCMC is really slow, though. And it suffers from the problem that the samples can be correlated. And the more dimensions you add (players in your tournament), the slower it gets. Luckily, we have something called Hamiltonian Monte Carlo – HMC, which instead of jumping around randomly in our probability space, slides around with momentum, like a frictionless puck on an ice landscape with hills and valleys. This method turns out to be less likely to get stuck in a valley and much more efficient in many dimensions, if you have continuous parameters (if you have discrete parameters, you are out of luck – use MCMC or reparameterize your problem).
The problem with HMC is that it is finicky with parameters, but we are in luck again! NUTS-sampling was invented, which takes care of setting the parameters automatically.
Simplifying somewhat, we can say that machine learning is normally interested in just MAP – the point estimate. For example standard deep learning. Sometimes someone wants to make a neural network where the weights have uncertainty, which can for example help in transfer learning. HMC is usually (but not always!) too slow for this, so they say that the posterior weights are normally distributed, use SVI and get on with their lives.
The best approximation of the true posterior is given by HMC, so if time permits, that is what we will use.
Our first tournament
To test the inference algorithms, let’s start with a simple tournament with three players. First p0 plays p1 (draw!), then p0 plays p2 (win for p0!) and finally p1 draws p2. p0 got 1.5 points, p1 got 1.0 points and p2 got 0.5.
A typical sign that we need statistical learning and not machine learning is that we are optimizing four parameters (three elos and a draw ratio) using only three data points. That is why we need some priors (the less data we have, the more important are the priors) and we need to check our uncertainty in the end, since we will still be very uncertain after three data points.
easy_tourney = Tournament([0, 0, 1], [1, 2, 2], [1, 0, 1])
If we want, we can use one of our guides to define a simple loss function, whose parameters can be optimized in the usual way. The parameters are stored and optimized in a global object (ugly, IMHO, why not in a with-context or something?), so we need to clear the parameter store between runs.
ps = pyro.get_param_store()
ps.clear()
loss = pyro.infer.Trace_ELBO()
[loss.loss(easy_tourney.model, easy_tourney.norm_guide) for _ in range(20)]
Above, we have 20 runs of our loss function with the same parameters. It is non-deterministic, because in our guide where it says pyro.sample, a sample is drawn from the distributions whose parameters we specify. This means that the loss function is really noisy (more observations would make it less noisy, I think), which is a challenge for any optimization method. MAP and SVI optimization is done using the same methods that we use for neural networks. In my experiments, Adam repeatedly proved best able to cope with all this noise.
Let’s use Adam to define a function that runs a few epochs, plots the improving loss and prints the parameters in the final solution.
import matplotlib.pyplot as plt
import pyro.optim
def optimize_parameters(model, guide, iterations, print_solution=True):
ps = pyro.get_param_store()
ps.clear()
svi = pyro.infer.SVI(model=model,
guide=guide,
optim=pyro.optim.Adam({"lr": 0.05}),
loss=pyro.infer.Trace_ELBO())
losses = []
for t in range(iterations):
newloss = svi.step()
losses.append(svi.step())
if print_solution:
ps = pyro.get_param_store()
print(ps.get_state()['params'])
plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("loss")
plt.show()
First we calculate the MAP. When we have just Delta functions in our guide, which will always sample to the same value, there is no noise, and it is easy to optimize.
optimize_parameters(easy_tourney.model, easy_tourney.MAP_guide, 3000)
OK, $draw\_ratio=0.58$ and $p0-p1=63, p0-p1=135$ (only the differences matter).
optimize_parameters(easy_tourney.model, easy_tourney.norm_guide, 3000)
Also optimizing for uncertainty turns out much noisier, but we get approximately the same Elo differences, though somewhat larger, and of course a huge uncertainty (the elos_sd tensor) after only three games. The draw_ratio ($\frac{\alpha}{\alpha+\beta}=0.54$) is also basically the same.
Now let’s use HMC to take a few samples and compare. warmup_steps denotes the number of steps that should be discarded because the algorithm has not yet converged yet. We take 300 samples. That is not a huge amount, but OK for now. If we wanted to do things right, we would produce several independent chains of these samples – several runs – to check that they agree.
import pyro.infer.mcmc
nuts_kernel = pyro.infer.mcmc.NUTS(easy_tourney.model,
adapt_step_size=True)
mcmc_run = pyro.infer.mcmc.MCMC(nuts_kernel,
num_samples=300,
warmup_steps=100).run()
Now we can create an empirical distribution from our trace, that we can sample from and calculate mean as well as plot the posterior.
marginal = pyro.infer.EmpiricalMarginal(mcmc_run, sites="elos")
marginal.mean
Now, we can plot the relative elos.
import matplotlib.pyplot as plt
def diff(i1, i2, telo):
return (telo[i1] - telo[i2]).item()
trace = [[diff(0, 1, marginal()) for _ in range(1000)],
[diff(0, 2, marginal()) for _ in range(1000)]]
plt.figure(figsize=(20, 10))
plt.hist(trace, bins=30, label=['p0 - p1', 'p0 - p2'])
plt.title("P(Elo | result)")
plt.xlabel("Relative Elo")
plt.ylabel("#")
plt.legend()
plt.show()
From only three games, we can’t really tell who is the best player. We can see that both the distributions go over to the negative side and that they are overlapping, meaning that we don’t know if p1 is better than p2, either.
We can also verify that the draw_ratio is approximately the same, below.
marginal = pyro.infer.EmpiricalMarginal(mcmc_run, sites="draw_ratio")
marginal.mean
A second tournament and a first scandalous result!
Before the main event, analyzing the Chess Olympiad, let’s verify our method one more time.
A common tiebreak in chess tournament is that if all players have played each other and two players are on the same score, you value results against the better players more. The question is if this really implies that you have played better, or if it is more based on a gut feeling of status.
Our tricky tournament has four players. p0 and p1 both got 2 points of 3. p2 got 1.5 and p3 only 0.5. The difference between p0 and p1 is that p0 beat p2 – the “better” player and drew p3, while p1 beat p3 and drew p2. With conventional tiebreaks, p0 would win. Let’s check if this is reasonable.
First with SVI.
p1s = [0, 0, 0, 1, 1, 2]
p2s = [1, 2, 3, 2, 3, 3]
res = [1, 0, 1, 1, 0, 0]
tricky_tourney = Tournament(p1s, p2s, res)
optimize_parameters(tricky_tourney.model, tricky_tourney.norm_guide, 10000)
Very interesting. Our SVI run says that the expected elo of p1 is higher than for p0. This surprised me and at first I thought that maybe it was just noise, but the result holds when I run SVI several times.
To make sure that this bombshell is confirmed, let’s see if the results hold with a HMC run. This time we use longer warmup and more samples, to make sure that we have the necessary resolution.
nuts_kernel = pyro.infer.mcmc.NUTS(tricky_tourney.model,
adapt_step_size=True)
mcmc_run_tricky = pyro.infer.mcmc.MCMC(nuts_kernel,
num_samples=1000,
warmup_steps=300).run()
marginal = pyro.infer.EmpiricalMarginal(mcmc_run_tricky, sites="elos")
marginal.mean
Identical results! So it is true, then! The tiebreaks rule are wrong! Seemingly.. In this case.. If Arpad Elo’s models hold true..
If you and another person played the exact same opposition (which is common in closed competitions) and got the same total score, beating a player with less points suggests that you are a stronger player. I think the reason is that my Elo priors, although very wide, give a slight preference for moving the Elo posteriors closer together, compared to a purely frequentist method (which would give no result at all for a player taking 0 points).
We must inform the world.
The true winner of the Chess Olympiad 2018
First, we will download all results from chess-results.com.
import requests
BASE = "http://chess-results.com/tnr368908.aspx?lan=1&art=2&rd=%d&flag=30"
def get_round(i):
r = requests.get(BASE % i)
return r.text
ALLTEAMS = []
import re
RES_REX = re.compile(r'<td class="CR">([^<]+)</td><td class="CRc">[^<]+</td><td class="CRc">[^<]+</td><td class="CRcErg">([^<]+)</td><td class="CRc">:</td><td class="CRcErg">([^<]+)</td><td class="CRc">[^<]+</td><td class="CRc">[^<]+</td><td class="CR">([^<]+)</td>')
def get_results(s):
for x in RES_REX.findall(s):
if x[1][0] == '&':
res = [0, 1, 3]
else:
win = int(x[1][0])
draw = 0
if x[1][-1] == ';':
draw = 1
res = [win, draw, 4 - win - draw]
if x[0] not in ALLTEAMS:
ALLTEAMS.append(x[0])
if x[-1] not in ALLTEAMS:
ALLTEAMS.append(x[-1])
yield ALLTEAMS.index(x[0]), ALLTEAMS.index(x[-1]), res
results = []
for i in range(1, 12):
results.extend(list(get_results(get_round(i))))
len(results), len(ALLTEAMS), results[0]
So, we have 1009 matches of four games each and 184 different teams.
While we’re at it, let’s also get the final standings to compare our results. This link only gives the top 150, but that will be enough for our purposes.
STAND_REX = re.compile(r'</div></td><td class="CR">([^<]+)</td>')
STAND_URL = "http://chess-results.com/tnr368908.aspx?lan=1&art=0&rd=11&flag=30"
r = requests.get(STAND_URL)
print('Final standings:')
standings = STAND_REX.findall(r.text)
standings[:10], len(standings)
def get_rank(team):
try:
return standings.index(team)
except ValueError:
return -1
TEAMRANKS = [get_rank(team) for team in ALLTEAMS]
Now we convert the data into a format that our Tournament class can accept.
osp1 = []
osp2 = []
osres = []
for p1, p2, res in results:
for i in range(3):
for j in range(res[i]):
osp1.append(p1)
osp2.append(p2)
osres.append(i)
len(osres)
4036 games. We model each match as four games between two “players” with one Elo each. So the full team gets one combined Elo.
Let’s once again start with solving the parameters with SVI.
os_tourney = Tournament(osp1, osp2, osres)
optimize_parameters(os_tourney.model,
os_tourney.norm_guide,
20000,
print_solution=False)
And here is a table of our calculated true Elo rating, together with standard deviation, what the “fair” final ranking of the team should be, and what their actual ranking was.
import pandas as pd
ps = pyro.get_param_store()
elos = [round(sol.item()) for sol in ps.get_param('elos_mu')]
sds = [round(solsd.item()) for solsd in ps.get_param('elos_sd')]
svi_ranks = sorted(zip(elos, sds, ALLTEAMS, TEAMRANKS), reverse=True)
pd.DataFrame(svi_ranks[:25],
columns=['Elo', 'Stdev', 'Country', 'Final result'])
We can see that our true, fair and generally infallible method is quite close to the actual (zero indexed) final results. We can see that Azerbaijan played much better than their actual final ranking and that Sweden’s position at 10 was way too high, but they still performed clearly – clearly – better than Norway.
On a slightly serious note, we can see by the standard deviation of about 50 and the difference between first and second place of about 4, that we don’t have nearly enough games to actually know which team is the best. There is just way too much randomness in this. The type of random noise that spectators love to interpret as significant.
We don’t quite have the final results yet. I trust HMC slightly more than SVI. The earlier promising results showed that the outcome will probably be the same for SVI and HMC, but let’s run a sampling for a few hours and check. If I used a GPU, it would be much faster.
nuts_kernel = pyro.infer.mcmc.NUTS(os_tourney.model, adapt_step_size=True)
mcmc_run_os = pyro.infer.mcmc.MCMC(nuts_kernel,
num_samples=1000,
warmup_steps=300).run()
First, let’s look at the draw_ratio.
marginal = pyro.infer.EmpiricalMarginal(mcmc_run_os, sites="elos")
marginal_draw = pyro.infer.EmpiricalMarginal(mcmc_run_os,
sites="draw_ratio")
marginal_draw.mean
Huh. Much smaller than I thought. Since we have a nice posterior to look at, let’s see how certain we are.
plt.figure(figsize=(20, 10))
plt.hist([marginal_draw.sample() for _ in range(1000)], bins=30)
plt.show()
Don’t let the rescaling plot fool you. This is a mighty thin posterior. We are more or less certain that the parameter is between 0.16 and 0.19.
And finally then, the HMC rankings, together with the real result and the SVI result.
svid = {x[2]: rank for rank, x in enumerate(svi_ranks)}
svi_results = [svid.get(team, -1) for team in ALLTEAMS]
mc_ranks = sorted(zip([round(sol.item()) for sol in marginal.mean],
ALLTEAMS,
TEAMRANKS,
svi_results), reverse=True)
pd.DataFrame(mc_ranks[:25],
columns=['Elo', 'Country', 'Final result', 'SVI result'])
We can see that HMC and SVI give an almost identical mean. This is good. Two completely different methods tell us the same thing. The opinion club on Twitter was.. right?! Poland should have won a medal! Also, Azerbaijan were even more robbed, but for some reason there was no public outcry for them.
We can now look at and compare the world’s top posteriors.
plt.figure(figsize=(20, 10))
teams = ['United States of America',
'Poland',
'Azerbaijan',
'China',
'Sweden']
dists = [[marginal.sample()[ALLTEAMS.index(team)] for _ in range(1000)]
for team in teams]
plt.hist(dists, bins=30, label=teams)
plt.legend()
plt.show()
Yeah, I could probably have visualized this better, but as you can see, they are almost identical. Sweden is clearly worse than the top teams, but even their distribution overlaps the top teams.
We can write a function that samples from the posteriors to see the probability that one team has a better true Elo than another.
def is_better(c1, c2, iterations=10000):
i1 = ALLTEAMS.index(c1)
i2 = ALLTEAMS.index(c2)
samples = [marginal.sample() for _ in range(iterations)]
tsamples = torch.stack(samples, dim=1)
nr = (tsamples[i1] > tsamples[i2]).sum().item()
percent = round(nr*100/iterations)
print(f'{percent}% chance that {c1} was better than {c2}.')
is_better('United States of America', 'China')
is_better('United States of America', 'Poland')
is_better('United States of America', 'Sweden')
And, since the world needs to know.
is_better('Sweden', 'Norway')
Finally (for real this time), we can sample from the posterior and check the probability for each team, that they were in fact the best one.
from collections import Counter
rks = [Counter() for _ in ALLTEAMS]
for i in range(10000):
rankings = zip([round(sol.item()) for sol in marginal.sample()],
ALLTEAMS,
range(len(ALLTEAMS)))
for j, (_, c, _) in enumerate(sorted(rankings, reverse=True)):
rks[j].update([c])
for country, nr in rks[0].most_common()[:10]:
print(f'{nr/100:.1f}% chance that {country} was the best team.')
Conclusion
Pyro + Pytorch is really easy to use. You can define functions with no magic or decorators or anything like that. Just define the random process that you think generates the data that you see, and you can get the most likely parameters for this process.
Probabilistic programming is especially well suited for when you have many parameters and not so much data, giving you an opportunity to work with your prior knowledge and the uncertainty in the output.