Jack Bailey

Redshift at the Polls

Jack Bailey — Mon, 28 Aug 2023 23:00:00 GMT

As you read these words, the heavens are expanding. Soon, distant galaxies will retreat so far away that their light will never reach us again. When this happens, it will be as though they never existed at all. We’ll still be able to see those stars that share our galaxy, but those outside of it will be lost to us forever.

We know this because the light these objects omit is redder than it should be. This occurs because, as the universe expands, it stretches the space between light waves travelling towards us from distant objects. Our eyes then perceive these longer light waves as being more red, leading to a kind of “redshift” in the colour of the stars.

Something similar is happening in British politics. As time passes, historic elections get further and further away from us. While we might still know the key results, the details become more and more uncertain. And, like distant stars in the night’s sky, soon they too will be lost to us forever.

This is a crucial problem for students of British politics. Since the advent of equal suffrage, only 24 general elections have taken place. This is not a large sample. As a result, there is only so much these data can teach us about how British politics works. In an ideal world, the problem would solve itself with the passage of time. With more elections comes more information that we can use to make sense of things. The problem, of course, is that the world is far from ideal.

This is also a crucial problem for British democracy, since the past is a reference against which we orient the present. Two issues spring to mind. First, that electoral uncertainty undermines political legitimacy. If we don’t know what happened, how can we be sure that we arrived where we are in a fair and democratic manner? Second, that the less we know about the past, the worse our ability to make decisions in the present. Are we doomed to vote for the same old mistakes over and over again?

Let’s use the 1945 UK general election as an example. It was, without doubt, the most important in recent British history. If Labour had not won, there would be no NHS and much of the post-war welfare state would look very different. Thus, we would expect to know a lot about it. Some things we know for sure. For instance, the result was a Labour landslide and the election occurred on the 5th July 1945. But others we do not. The official results are, to all intents and purposes, lost to history. We can do our best to piece them together using whatever contemporaneous data we can find, but much uncertainty remains.

When it comes to what happened in 1945, our two best sources are the Constituency-Level Elections Archive and Resul Umit’s constituency-level results data. Yet the two don’t even agree on what the vote count was. According to CLEA, it was around 27 million. According to Umit, it was 25 million instead.

Given these differences, we might ask “what do the official statistics say?”. While this is a sensible question, there is one problem: there aren’t any. The House of Commons library does host data on elections since 1918. Yet most of these data are second hand and exist only thanks to psephologists like FWS Craig who had the foresight to collect them at the time.

To be clear, I am not looking to deride either CLEA, Umit, or the House of Commons Library. All do valiant work that we would be much poorer without. Rather, my point is that if we care about British democracy, we need to preserve evidence of it. Otherwise, data will degrade, errors will accumulate, and uncertainty will overwhelm us.

Fundamentally, this is a failing of the British state. Labour’s historic landslide occurred within living memory only 78 years ago. Indeed, there will be thousands of people alive today who were also alive to witness Clement Attlee become Prime Minister. Yet, in this time, we have lost the ability to say what Labour’s vote share was at this most crucial point in its history. Though I have used 1945 as an example here, I’d wager that this is also true for all other intervening elections, since the state has no repository of election results nor has one ever existed.

The question then is what we should do about it. I like FWS Craig’s suggestion: that Returning Officers should have to send data on every election to the Clerk of the Crown.¹ But this alone is not enough. The government should also make the raw data available online in a transparent and open-source format that anyone can access. Further, the data should be permanent, with any changes resulting in a new version of the data and an accompanying change log.

Now is the perfect time to be an astronomer. After all, we live in a time when telescopes exist that are powerful enough to see distant stars, galaxies, and other celestial way markers. That might not always be the case. The same is true for political science: we now know more and have more data than ever before. But if we want that to continue, and to protect our democracy, we need to avoid our own redshift at the polls and do what we can to preserve historic election data.

Footnotes

Thanks to Elise Uberoi for making me aware of this↩︎

Introducing Shannon Regression

Jack Bailey — Thu, 03 Aug 2023 23:00:00 GMT

I doubt many political scientists spend time dwelling on information theory. But, over the past year, I’ve spent a lot of time reading and thinking about it. And I am beginning to believe that it has a lot to offer the study of social and political systems.

I’ve found two books on the subject most enlightening. The first is James V. Stone’s “Information Theory: A Tutorial Introduction”. The second is Jimmy Soni and Rob Goodman’s “A Mind at Play: How Claude Shannon Invented the Information Age”. Together they provide a guide to the basics and the development of the field.

Information theory owes its existence to one man: Claude Shannon. Shannon more or less invented the field, then solved most of its problems, in his landmark paper “A Mathematical Theory of Communication”. His idea concerns communications first and foremost. But, really, it’s all about probability. As such, we can apply its insights to any other system that also involves probabilistic outcomes.

In this vein, this post describes a new type of regression model that I’ve developed. Like logistic or probit regression, it models binary outcomes. But, unlike logistic or probit regression, it uses Shannon’s information content as its link function. This model – which I call “Shannon regression” – has some useful properties. In particular, it measures its coefficients in bits.

I’ll probably write a paper on the method once I understand it better. But I am keen to get it out there so that others can use it and so that I can get some feedback. So, for now, this blog will serve as a kind of way marker. Both for myself, so that I can develop a better understanding of the model, and for others, who might find the project interesting and useful.

Understanding Information Content

To understand Shannon regression, you need to know what information is and how to measure it. If that’s something you know, feel free to skip to the next section. If not, let’s take a moment to spell it all out.

We measure information in “bits”. By definition, 1 bit of information reduces our uncertainty by half. For example, imagine that I toss a fair coin in the air, then prevent you from seeing how it lands. Before you see the coin, you have a 50/50 chance of guessing how it landed. I then reveal the result: a head. Now, things change. Because you know how the coin landed, you are certain to guess correctly. You went from choosing between two possible outcomes to one. In other words, your uncertainty halved. Such is the power of 1 bit.

To compute the information of some event is straightforward. This is true whether we’re talking about a coin toss or any other probabilistic event. All we need to do is use the equation for Shannon’s information content:

Likewise, we can convert bits of information back into probabilities using the following exponentiation:

You might ask “why bother?”. Almost every single advance in the information age seems a good enough reason to me. But another is that doing so yields some nice properties that can be useful in certain contexts. Chief among them is that when probabilities multiply, information adds up. For instance, the chance of guessing 1 head is , 2 heads is , and 3 heads is . But the amount of information you’d need is just 1 bit, 2 bits, and 3 bits.

This point is worth stressing: 1 bit of information is how much you’d need to guess a fair coin flip. Since information is additive, it follows that we can interpret it in terms of coin flips too.

Most of the people reading this are likely political scientists, so let’s use a political example. At the 2019 UK general election, the Labour Party got 32.1% of the vote. So how much information would we need to guess that someone was a Labour voter assuming we knew nothing about them? Well , so not much at all. It would take more information than we’d need to guess 1 coin flip, but less than to guess 2. What about one of the smaller parties? In 2019, UKIP got only 0.07% of the vote (yes, really). That gives bits of information. That’s a lot of information! You have more or less an equal chanceof guessing that someone voted UKIP as you do guessing 10 fair coin flips in a row.

Building the Model

Now that we know a little information theory, we can move onto the modelling. All generalised linear models use something called a “link function”. As the name suggests, it is a function that links the outcome scale to some other scale. We do this because it’s often hard to fit a line to the original scale, so we do it on another one and then transform the resulting predictions back onto the original scale.

Logistic regression, for example, uses the logit link function. The word “logit” might sound complicated, but it’s really just a fancy way of saying that we compute the odds of something happening and then take its logarithm. Let’s use a fair coin toss again as an example. Since the coin’s fair, you know that the probability, , of it landing heads up is 0.5. To convert this probability into logits, we just stick it into the following equation:

Here, we first compute the odds of getting a heads, . Then, we run the resulting odds through the natural logarithm function, , to convert it to log-odds or logits, . These models also make use of “inverse link functions” that perform the opposite operation. For example, the inverse logit function converts logits back into probabilities. So if we have some outcome that we have measured in logits, , we can convert it back into probabilities as follows:

And if we sub in the answer from before we get , the probability of getting a heads on a fair coin.

Other models of binary data use other link functions. The probit model uses the cumulative density function of the normal distribution. Likewise, the cauchit model uses the inverse Cauchy distribution. There is no right or wrong answer here. Different models have different properties that make them more or less useful in certain situations. But most of the time the choice comes down to habit. Economists tend to use probit regression because they always have done. Other social scientists tend to use logistic regression for the same reason.

In practice, we can use whatever link function we want. As the name suggests, Shannon regression uses the equation for Shannon’s information content as its link function:

And, as its inverse link, it uses the exponentiation from above that turns bits of information back into probabilities:

I thought that implementing this model was going to be hard. But it turns out that programming custom families in R is really easy. All you have to do is create a function that lays it all out. I’ve called mine “bits” to reflect the unit of measurement:

# Define "bit" link function

bit <- 
  function(){
    linkfun <- function(mu) -log(mu, 2)
    linkinv <- function(eta) 2^-eta
    mu.eta <- function(eta) -log(2)/(2^eta)
    valideta <- function(eta) all(is.finite(eta) & eta >= 0) 
    link <- "bit"
    structure(
      list(
        linkfun = linkfun,
        linkinv = linkinv, 
        mu.eta = mu.eta,
        valideta = valideta,
        name = link
      ),
      class = "link-glm"
    )
  }

Most of this should be pretty self explanatory:

linkfun: Specifies the link function
linkinv: Specifies the inverse link function
mu.eta: Specifies the derivative of the inverse link with respect to eta
valideta: Specifies valid values that eta can take
link: Specifies the name of the custom family
structure: Tells the glm function what everything does

Let’s simulate some data and run the model. Given the theme of this blog post, we’ll imagine that we’re running an experiment involving coin flips. We recruit participants, then assign them either a 0 or a 1 at random. Those in the control group, where , get a fair coin where the probability of getting heads is 0.5. Those in the treatment group, where , get an unfair coin where the probability of getting heads is only 0.25 instead After giving our respondents their coin, we ask them to toss it, then record if they got a heads (1) or a tails (0).

# Specify simulation parameters

n <- 10000
fair_heads <- 0.5
unfair_heads <- 0.25


# Assign respondents to a treatment status

treated <- 
  sample(
    x = 0:1,
    size = n,
    replace = T
    )


# Get coin toss outcomes

outcome <- 
  rbinom(
    n = n,
    size = 1,
    prob = ifelse(treated == 1, unfair_heads, fair_heads)
  )


# Fit the model

coin_model <- 
  glm(
    formula = outcome ~ 1 + treated,
    family = binomial(link = bit())
  )

Now that we’ve fit the model, let’s check its output:

	(1)
(Intercept)	0.971
	(0.020)
treated	0.992
	(0.040)

Because we used the equation for Shannon’s information content as our link function, the coefficients are measured in bits of information. Like almost any model of a simple experiment, ours has an intercept and a treatment effect. Let’s work out how to interpret them step by step.

The intercept tells us how many bits of information we would need to guess the outcome for someone in the control group. The coefficient itself is 0.971, or about 1 bit of information. This makes sense. Recall that we gave those in the treatment group a fair coin and that we need 1 bit of information to guess the outcome of a fair coin toss.

The treatment effect tells us how many additional bits of information we would need to guess the outcome for someone in the treatment group. The coefficient is 0.992, again about 1 bit of information. When added to the intercept, this gives 1.963, or about 2 bits of information. Again, this makes sense. We gave those in the treatment group an unfair coin that landed heads side up with probability 0.25. So to guess it right, we’d need 2 bits of information, which is how much information it takes to guess the outcome of 2 fair coin flips since .

At first, it might seem unusual that the treatment caused a negative change in the probability of getting a head but yielded a positive coefficient. But it’s actually quite intuitive when you think about it in information theoretic terms: all we’re doing is counting up coins. And since rarer events are akin to guessing more coin flips correctly, it takes more bits of information to guess less frequent events.

Conclusion

I’m prepared to accept that one response to all this might be “so what?”. That’s fair enough. We don’t all learn about information theory and I don’t expect people to read this post then switch to using Shannon regression en masse. That said, I do think that there are certain use cases where Shannon regression might be useful.

The most obvious use case is to use the model to decompose the information content of some event into its constituent causes. For information theoretic studies, this would be especially useful. Another use case is where other parts of the study also use information theoretic quantities. Here, computing effect sizes in bits might would allows all parameter estimates and quantities of interest to share a common scale.

But, ultimately, I don’t care what the use case is. And I’m sure I have missed some that are blindingly obvious. Shannon regression is neat whatever the case and by putting it out into the world it’ll hopefully find a use in its own time.