Random Variables - HTML Version
Random Variables
As with the other refresher courses, you will possibly have seen the material below. If you have not, don’t worry, every student’s journey is different. This material will all be taught again once you arrive
What is a Random Variable?
In the background material, we described a football match between Manchester United and Manchester City as an experiment, with the result of the match expressed as outcomes. Instead of representing the outcomes as Win, Lose or Draw, we could represent as the number of points United received for each outcome, namely 3 for Win, 1 for Draw and 0 for Lose. We could also represent the result as the final score, and use a positive or negative number for the difference in scores; e.g. -1 would mean City scored one more goal than United, 0 would be a draw, etc. The difference in score tells us who won, and thus how many points they got, while also providing information on performance.

Outcomes are more useful to us mathematically. If we represent each outcome as the points received, then summing all outcomes over the season gives the final points earned. If we sum the goal difference in each game, we get the goal difference over the entire season.
A student answers 10 questions in an exam. We could represent the final result as a sequence of correct ( ) or wrong ( ) answers, e.g. CCWWCWCCWW. Instead, let and and write the sequence as 1100101100. Now, summing the sequence tells how many correct answers, , were given, summarising the 10 digit sequence and the final result in just one integer.
The overall performance and outlook of a company depends on several factors, often interacting with each other in complex and random ways. However, its share (or stock) price says how much outside investors are willing to pay for the company, giving a summary of both past performance and future outlook. One price represents all of the complex underlying factors.

Not every experiment results in a number, with the above 3 examples demonstrating how useful it is to translate outcomes into numbers. A random variable makes this transformation.
Definition 1. A random variable is a function that applies a numerical value to the outcomes of a random experiment. We call an observed value of a realisation, denoted x.
The random variable now represents the experiment mathematically, with any outcome being a realisation. We define the probability of an outcome happening as .
I think this definition of a random variable is rather weird and difficult to conceptualise but is one you will see in University. I simplify this to a random variable being the outcomes of an experiment written as numbers, ignoring the “function" aspect.
Types of Random Variables
Random variables are functions and so we use appropriate terminology. The sample space of the experiment is the domain of . The range of , Range( ), is the set of numbers that represent the outcomes. For example, if for the football match, we might let Range( ) = , the corresponding points received. Here the range has a finite number of outcomes. If corresponds to the number of correct answers out of 10, then Range , again finite.
If for the football match is the goals difference instead of points, the range is then both positive and negative integers. In theory, there is no upper (or lower) bound to the goal difference, and so we could let Range . The set of integers is infinite but it is discrete. Sets like and are also discrete; you will learn why but the following definition is enough for now.
Definition 2. A random variable is called discrete if its range is finite or a subset of .
Recall that and are subsets of . One intuition for what makes a set discrete is if there are “gaps"; we can mark each number down distinctly. Another intuition is that we can”list" all the numbers.

However, for the share price example, the range is all positive real numbers , not just . Imagine in the football game we measured the distance all players ran; this would also not be discrete. We call such random variables continuous.
A continuous range does not need to go as far as , a common example is the interval ; all numbers between and . This is not discrete as there are no gaps; if we choose any two numbers, we can always find a number between them that is also in the interval. Despite not being infinitely long, there are infinitely many numbers within.
Definition 3. A random variable is called continuous if it has a continuous range.

You will see a more detailed explanation later this year.
Measuring the exact time an exam takes or the heights of students in a class are further examples of continuous random variables. However, we often measure these things in rounded units; using seconds or centimetres but can always get more and more accurate.

Key Takeaways:
-
A random variable is a mathematical formulation of an experiment
-
The elements of the range of a random variable are the numbers representing the sample space of the experiment
-
A discrete random variable has a range that is finite or infinite with “gaps"
-
A continuous random variable has a continuous range with no “gaps".
Probability Distributions
It’s helpful to translate experiments into numbers, but we also want to have some idea how often these occur; their probability.
In football, we want to know how often goals are scored or the probability a team wins a certain number of games. When it comes to share prices, we want to know the probability of the price increasing by a certain amount, or at what price we should sell. When measuring the heights of students in a class, we want to know what proportion of students should be below a certain height. For a random variable, we describe the probabilities of specific realisations through a probability distribution.
Probability Mass Function
We have already seen an example of a probability distribution in the refresher. When we roll a fair die, there are six equally likely outcomes. We give each of these outcomes a probability of and put them in a table as below (now with random variable notation).
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
The roll of a die has a finite range and so we write each explicitly in a table. When the range is something like , we need to be a bit more general (which we will see later).
Definition 4. For a discrete random variable
, we define the probability mass function (pmf)
such that
, a function assigning a probability to each element of the range.

Note: The domain of is rather than just . For any real numbers not in we define their probability to be 0 and don’t need to list them.
In order to be a pmf, must satisfy:
-
: each probability must be between 0 and 1
-
: the sum of all probabilities must be 1.
We can use the pmf to calculate probabilities on subsets of the range by summing as before.
Cumulative Distribution Function
For a continuous random variable, it is impossible to list each probability explicitly. In fact, each outcome must have a probability of 0; . Why? There are too many options to think any are actually possible to happen. This is one of the trickier concepts when thinking of infinity and so don’t worry yet if you don’t understand.
The probability a share price rises to exactly B#280 is 0; there are too many possibilities. However, we can find the probability the price is between B#279.99 and B#280.01. So, rather than , think of where .
Definition 5. For a random variable
, we define the cumulative distribution function (cdf)
as
.
The cdf is the probability of taking any value up to and including ; the cumulative probability up to . Writing outcomes as numbers allows us to use set notation here.
We use the cdf to find the probability of being in any interval as:
Note: The set can be written as . For a continuous random variable, knowing implies .
The same definition of the cdf works for discrete random variables, and can actually be calculated explicitly as:

We see the stepwise nature of the cdf for a discrete random variable; at each possible realisation there is an increase after adding on the corresponding probability. We sum the bars of the histogram for the pmf as we go along.
Just like with the pmf, we write the cdf as a table or create a general formula. See below for a table corresponding to the die roll.
1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|
A key difference from the continuous case is that because individual outcomes may have nonzero probabilities in the discrete case, for discrete .
If we are asked to find , we use the fact that probabilities sum to 1 to get: However, if asked for , the discrete case has to add on .
Example 1. A random variable can take values , , and . Please see below for a table detailing the pmf . Use this table to create a new table for the cdf .
0 | 1 | 2 | 3 | |
---|---|---|---|---|
First, we double check that this satisfies the characteristics of a pmf. Each probability is nonnegative and less than 1. Further, the total sum .
Then, we calculate each element of the cdf table explicitly. Note there is a recursive nature to the calculation; rather than calculating the whole sum for each , we just take the cumulative probability up to the previous realisation and add . So:This gives the following table:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
The cdf of the largest outcome should always be 1, as everything must be less than or equal to it. Also, note how the cdf is a non-decreasing function; it either stays the same or increases at every jump.
A cdf table gives you the pmf by inverting the process.
Example 2. A discrete random variable can take values of 0, 2 and 4. The cdf is detailed below.
0 | 2 | 4 | ||
---|---|---|---|---|
0.3 | 0.7 |
What is
?
We have that
and
. We also know that:
So
.
Probability Density Function
So for discrete random variables, we can find the probability, or “mass", of both points and intervals. For continuous random variables, only the mass of intervals is possible; individual points have mass 0. So, in the continuous case, we instead think of”density" of points. The mass of the interval is then a function of the density of the points within and the width or “volume" of the interval.
To get from the pmf to cdf we summed. Integration is basically summing over an interval.
Note: We use in the integral as a “dummy" variable because it would not make sense to have both as the variable being integrated and in the limits of the integral.
Conversely, we have:
So, we can get the pdf from the cdf or vice versa.
The pdf will usually be presented in the following way:
See that is a function of the whole real line; any value not within the range is given density zero.
As with the pmf, we need the “sum", namely integral, to be 1:
However, as we are dealing with density not mass, the pdf is not restricted to being between and like the pmf is.
Note that our definition for allowing for any real number also allows us to define integration from to . If we were to plot the pdf , these requirements corresponds to a function that will always be nonnegative and with an area under the curve of 1.
The following is Question 1 in the November 2020 Further Maths statistics exam.
What is ?
Therefore, .
The following was Question 8 in the 2023 Further Maths Statistics paper and has a video solution accompanying it.
-
What is ?
-
Find
-
Find the median of to 3 significant figures
-
Find in the form , where
Solution 2. Click Here for Video
We refer to the pmf or pdf of a random variable as its probability distribution. We often use
rather than
to refer to the function in general.
Key Takeaways:
-
A discrete random variable has a pmf which gives the probability of any realisation
-
A continuous random variable has a pdf which gives the density of any realisation, and probabilities exist only over intervals.
-
The cdf sums the pmf or integrates the pdf to get probabilities of sets
-
A pmf/pdf must always sum/integrate to 1
-
A pmf must always have values between 0 and 1
Expectation and Variance
A probability distribution tells us how likely it is that United score 3 goals against City. It tells us what proportion of students should be taller than 180cm. It tells us the probability a share price falls below B#200.
However, it’s often more interesting to know what happens on average. What is the average number of goals United score? What is the average height in the class? What is the average result I get when I roll a die? What is the average share price?
When we know these average values, it’s also interesting to know how variable the data is around this average. If United score 2 goals a game on average, are they scoring 2 goals consistently or do they sometimes score 0 and sometimes 4? Does my class have some very tall and very short students, or is everyone roughly the same? Is the share price very volatile, or is it a safe investment?
We explore ways you may have seen before of summarising the properties of probability distributions and random variables. If you have not seen these concepts in such detail, don’t worry, it will be taught once you arrive.
Expectation
For a random variable , we are interested in the average result, or what we expect to happen. Think about calculating the mean from a frequency distribution table, except with probability rather than frequency.
Definition 7. The expectation or expected value of a random variable , , is the mean of the outcomes can take weighted by their probabilities.
For a discrete random variable: is a generalisation of the weighted average: we take each outcome and weight it by its corresponding probability. We then sum all the weighted outcomes.
So, when calculating the expect number of goals United score, we weight the amount of goals by the probability they score that many goals and sum everything.
E extends to continuous random variables as we might expect: For outcomes that lie outside the range, their probability and thus weight is 0. From this point onwards, we will use the pdf/integral definition for convenience unless otherwise stated; for discrete use a sum instead of integral.
In this case, . Thus:
As the die is fair, the weighted and simple averages are the same.
For some function , we can also define the expectation as: This is is often called the law of the unconcious statistician, or LOTUS (although you probably have not seen it called this). For example, if , then:
Note: We only replace the first with in the formula, we do not change the . We change the value but not the weight.
Using the formula for (feel free to try prove the following yourself), for some we get: We call this linearity of expectation. If , this shows ; we can take the constant factor outside of the expectation.
Think about if we subtracted 5cm from the height of every student in the class. The new expected value would just be the previous minus 5cm. If we doubled their heights, we would double the average.
Suppose we have two random variables and . Then:
Combining all our rules together we have:
Intuition - Games
In the background material, we mentioned how games (and gambling) were often a driving force for the development of probability. Games also provide some intuition for expectations.
In a game, think of as the average score. So, for a 6-sided die, I expect to get 3.5 each time I roll, despite the fact I can never actually roll 3.5. This is important when the score relates to some monetary payout; if I get B# when I roll an , then I make B#3.50 on average each time I play the game.
gives a fair price for the game. If I have to pay B#3 to play this game then I should always play; I make B#0.50 on average every time I play. If it’s B#4 pound, I should find something else to do as I lose B#0.50 each time (all casino games that are pure chance are similar). B#3.50 would be a fair price; playing and not playing end up the same on average and it’s up to you how much risk you want to take. This line of thinking will be covered in greater detail once you start studying game theory and financial mathematics.

Instead, suppose I roll the die and the amount I receive is the result squared. Now, my average payout will be ; I calculate how much I would get each time ( ) and weight it by the probability.
Now suppose the game consists of rolling a 6-sided die and a 12-sided die. I add both numbers together and receive the total. If I let be the 6-sided and be the 12-sided, my return is . My average return will then be and so I can calculate both expectations separately and add them together.
Median and Mode
When looking at the heights of students in a class, the expected value is just one measure of central tendency. Rather than thinking of the average height, instead we can think of the most common height. Alternatively, we might be interested in the height which is right in the middle; 50% of students are taller, 50% shorter. You may have seen these concepts before when analysing data sets, but they extend to probability distributions.
We begin with the most common outcome, the mode.
The stands for “argument max", the value of that maximises ; you have probably not seen this notation before. If we just had , it would be the actual maximum probability rather than the value.
If we plot a probability distribution, the mode will be the value corresponding to the highest point in the graph. To calculate the mode, we use techniques from calculus to find maximum points. In a game, the mode is the most likely outcome.
The median of a dataset is the value right in the middle, with half the possible outcomes on either side.
So, the median is the value with a probability of 0.5 of being lower (and thus higher) than it. In a game, it would be the score that half the time you beat, half the time you don’t.
In a plot of the probability distribution, the median will be the which has half the area under the graph on either side of it. To find the median, we solve the algebraic equation .

Here we see a skewed distribution with the mean, median and mode labelled. The mode is the highest point in the graph. The median is moved slightly to the right, allowing for half the area under the curve to be either side of it. The mean is even further to the right than the median, as it depends not just on the probabilities but also the values; bigger values will increase the mean but not the median.
Variance
Knowing what happens on average can only tell us so much. Knowing how far the outcomes can be from the average is also valuable.
Suppose I record the heights of two (very large) classes of students and they both have the same mean. In Class 1, most students are pretty close to the expectation. In Class 2, there is a large spread to the heights; most students are in the middle but some are short and some are tall. Notice below how the increased spread decreases the height (of the curve, not the students).

Suppose instead I am deciding between two stocks to invest in, both with the same price. Stock 1 has major fluctuations in value while stock 2 is pretty consistent. Some way of measuring the volatility and comparing the two is very useful.

We want some way to describe the spread of the data. Well, think about the average distance from the mean; a large average distance would mean a large spread. So, we calculate , the deviation from the mean. However, we generally square distances when calculating, and so we take the squared deviation from the mean, . Then, we want the weighted average so take expectation. This gives the variance.
Note: is a constant in the above formula, so for LOTUS.
The larger the variance, the more likely outcomes far from are to occur. A share price with high variance is risky, with a chance of making a big return but also a big loss. A share price with low variance is more predictable and probably give a steady but unspectacular return.
Similar qualities to linearity of expectation exist for variance. For :As the variance of a constant is zero, adding it to has no effect on variance. Think about subtracting 5cm from the height of every student to account for the machine used; it won’t change the actual variability, just the values.
However, multiplying by will increase the variance by a factor of ; this makes sense given the power of 2 in .
We often use a different formula to calculate the variance:Generally, we know and it is easier to calculate than .
Suppose we have two random variables and . Then, if they are independent, we have: If they are dependent in some way, there is an extra covariance term which we will not go into.
This gives us a general formula for and independent and :Note: Even if we are looking at (so ), the variance will increase as is positive. If we include a new random variable, it can only get more variable.
Example 5. Imagine and are throws of two fair 6-sided dice. Then . However, .
Think about what the range of is. It can go as high as , when and , but also go as low . That is, . Each of these outcomes will have a probability based on the pairs of numbers that could generate them. The range of outcomes is now larger than before and so is the maximum distance from the expectation; it is now 5, it used to be 2.5. Hence, the variance has increased.
Now try the following quick question; the first statistics question from 2019 Further Mathematics A Level.
Question 3. We know that . What is ?
The following is a modified version of Q4 in the 2020 Further Maths paper.
Now, , and so , as required.
Key Takeaways:
-
The expected value of a random variable is a weighted average
-
The mode is the value with highest probability
-
The median is the value that has a probability of 0.5 either side of it
-
The variance measures the spread of the distribution
-
,
-
,
Specific Distributions
So far, we have seen examples of experiments and random variables, and talked about distributions. However, we have not discussed specific distributions for those examples.
In this section I introduce several such distributions. If you studied Further Maths for A-levels you may have seen them all. If you studied Maths, you may have seen just the first two. Nonetheless, no matter what your mathematical journey until now, it is ok to not have seen (or remember) some of these. I will recap them now as they are necessary to solve some of the sample exam questions but they will be covered again in detail once you get to Warwick.
Binomial Distribution
Many experiments have two possible outcomes; success and failure. This could be answering a question on an exam, playing a tennis match, buying a car at an auction. Any experiment with two possible outcomes is called a Bernoulli Trial, where is the probability of success.

Flipping a coin is the classic example of a Bernoulli trial. Let be the random variable where , a success, and , a failure. For a fair coin, , and is a Bernoulli random variable.
However, often we are more interested in a series of Bernoulli trials. Rather than flipping a coin once, we flip it ten times and count the heads. Rather than answering one question we answer a whole exam and check the final score. Rather than playing one tennis match we play a whole tournament. If each of these is a Bernoulli random variable and is a success, we add all values of the random variable together to count the number of successes. Then, we can find the probability of a particular number of success (5 heads out of 10, 40% final score, 10 wins out of 12 matches).
The sum of Bernoulli random variables is called a Binomial random variable, under two conditions. First, the number of trials, , must be fixed. We need to know how many coins will be flipped or how many questions are on the exam. The tennis fans among you will know that usually tournaments are knockout; you play until you lose. In this case, is not fixed and a Binomial is not a suitable distribution.

Second, the probability is fixed between trials. In other words, each trial is independent of the previous. If we were to draw a tree diagram, the edge probabilities would always be the same. In the coin flip example, flipping a heads should have no effect on the rest of the flips. In the exam, however, maybe the questions get more difficult as you go through, and so actually decreases with each trial; the Binomial would not be appropriate.
Definition 11. Suppose is the number of successes recorded in Bernoulli trials with success probability . Then, follows a Binomial distribution if:
-
The number of trials is fixed
-
The trials are independent of each other ( is fixed).
We write or .
The Binomial is a discrete random variable, as Range( ) . The probability of recording successes in trials is given by the pmf: where is the Binomial coefficient.
Note: We don’t use the notation for the pmf here so as not to get confused with the probability of success .
For a distribution, we have:
The expectation is simply the number of trials multiplied by probability of success. The variance is maximised (check yourself) when ; success and failure being equally likely gives most uncertainty.
As mentioned in the Tree Diagrams refresher, any discrete random variable can be transformed into a Bernoulli by picking a set of outcomes as success and the rest as failure. When rolling a die, let even numbers be success and odd be failure. When spinning a roulette wheel, let our chosen numbers be success and the rest be failure. In a football match, let winning be a success and losing or drawing be a failure. This make the Binomial distribution extremely flexible once the conditions on and are satisfied.
The following is a modified version of Q15 on the 2024 A level mathematics paper 3.
Question 5.
-
Find
-
Show
-
Find
-
Find
(b)
(c)To get we could find the probability for each number 6 and above and add them together. It is quicker to use the relation . To get , we get the probabilities for .
Summing these together gives
and so
.
(d) To get
, we find the probability for each number between 9 and 15 and sum them. This gives us
.
Normal Distribution
The Binomial is the most common discrete random variable you may have encountered up until now. When dealing with continuous data, the Normal distribution is most common.

When measuring heights of students in a class, a reasonable first assumption to make on the distribution is symmetry; a student is as likely to be tall as they are short. Similarly, assume a share price is just as likely to increase as decrease. For a symmetric distribution, the mean and median are equal.
A second assumption we might make is unimodality; there is only one mode, a single peak in the curve. This assumes we have only one height in the class that is most common, or one share price that is most likely. For a symmetric distribution, the mean and mode (and thus median) are equal.
The cdf of the Normal Distribution is not available in a nice, easy form, and so is instead written as an integral of the pdf. The graph of the pdf for multiple values of and is below; dictates location and dictates the width/spread of the graph.

The shape of the distribution is a bell curve. Often, the phrases “Normal" and”bell curve" are used interchangeably. However, other distributions have bell curve shaped graphs too, so use Normal or Gaussian to be specific.
The mean tells us where the center of the bell curve is, while the standard deviation describes its width. A large makes it more likely to observe extreme values and thus less likely to observe values close to . We see how a bigger (the green curve) leads to a more “squashed" graph. As the area under the curve is fixed to 1, increasing the width decreases the height.
The Normal is continuous with ; technically, any real number is possible. However, as we get further from the expectation, the probability decreases to the point where it is essentially impossible. The larger the variance, the larger this range of reasonable values.
The standard deviation leads to the following empirical rule:
-
c.68% of the distribution is within one standard deviation of the mean
-
c.95% of the distribution is within two standard deviations of the mean
-
c.99.7% of the distribution is within three standard deviations of the mean

This is why hypothesis tests often use a 95% significance level; it is two standard deviations either side.
We can assume is a Normal random variable, but until we record real data it is impossible to know if this is appropriate. A histogram of the data checks whether the requirements are met. Nonetheless, the Normal is a good default choice for a distribution due to its nice properties (which we will see) and ubiquity in nature. I often think of dropping marbles to form Pascal’s triangle; most will settle into the middle but some will spread out evenly.

The symmetry of the Normal distribution gives us the following property: Thus, complex probability statements can be generated from simply knowing .

The Standard Normal Distribution
The ability to break down complex statements is important because we cannot directly integrate the pdf. Instead, we use probabilities already given to us (such as in log tables). However, it is impossible to have these probabilities available for each possible and , so we use a reference.
Rather than thinking of the heights of students, think of their difference from the average height. Then negative values will indicate shorter students and so on. For a share price, think of the change from some starting price instead.
Definition 13. We say follows the Standard Normal distribution if it is a Normal distribution with and .
Note: We usually represent the Standard Normal with a but can also be used for any Normal. Sometimes, is used for the Standard Normal and for its cdf.
To get from any Normal to the Standard Normal, we standardise.
Definition 14. Suppose Let . Then, .
So, to standardise any Normal random variable we:
-
Shift by subtracting the expected value
-
Scale by dividing by the standard deviation
Using the previous properties to help intuition:
Standardising the Normal is very useful because we only need the specific probabilities for this one distribution; everything else can be derived from them.
The Normal Approximation
The Normal distribution is widely studied not just because of its simplicity and ubiquity, but also because it has a very useful property for large sample sizes.
Suppose counts how many twos you get in rolls of a fair 6-sided die, such that . We can easily find . However, getting would require individually calculating the probability of all 81 numbers in the set. There is a quicker way.
Definition 15. We say a discrete random variable has a Normal approximation if under certain conditions.
By approximating the number of twos with a Normal distribution, we find through two calculations, and , rather than 81.
Note: It is often recommended to perform a continuity correction when approximating a discrete random variable with the Normal. This entails adding a small number, usually , to the value of interest to account for the starting point of the interval. I will not do this in my solutions for simplicity.
The Binomial is the most common distribution to approximate. It becomes more symmetric as grows and it is tedious to sum probabilities. Thus, when is large, the Normal approximation of the Binomial is appropriate.
Note: It is common to make statements about “large" , but often what constitutes large is not given or justified. What it actually means is that it is true for the limit as , and so we choose an along the way to get close. The larger is, the better the approximation.
The reasons for the Normal approximation are rooted in the Central Limit Theorem and Law of Large Numbers; you will explore these in great detail but these are core properties as to why the Normal distribution is so popular.
The following is a modified version of Q17 from the 2024 A level Maths Paper 3.
Question 6. In 2019, the lengths of babies at a newborn clinic can be modelled by a Normal distribution with mean 50 and standard deviation 4.
-
State the probability that the length of a new-born baby is less than 50 cm.
-
Find the probability that the length of a new-born baby is more than 56 cm
-
Find the probability that the length of a new-born baby is more than 40 cm but less than 60 cm.
-
Determine the length exceeded by 95% of all new-born babies at the clinic.
(a) We want . Notice that is the mean and by symmetry also the median. Hence, the probability is .
(b) We want . To find this, we standardise. So, we need , which is . Thus,
(c) To find we standardise both end points. This gives the interval . Using some algebra:
(d) This is the length such that . This is equivalent to and thus such that .
Certain Normal tables might give you this number, but generally we must take advantage of symmetry and find . We find and thus .
Finding is the first step, we must now transform it back to :So 95% of babies are longer than 43.42 when born.
Note: Part (d) references 95% but this is not 2 standard deviations away like the empirical rule. There is a difference between the number with 95% above it, as in (d), and the 2 numbers with 95% between them. Those two numbers have 2.5% either side.
A Quick Note
The following three distributions were not covered in A level Maths but may have been in Further Maths. They will be used to solve questions from Further Maths or STEP later but don’t panic if this is your first time seeing them.
Uniform Distribution
The simplest form of any probability distribution is to assume each outcome is equally likely. When we roll a fair die or flip a fair coin, we have no reason to believe any outcome is more likely than another; we assume the probabilities are uniform.

Definition 16. A random variable follows a uniform distribution if every outcome in the range has the same probability of occurring.
For a discrete random variable with a finite range, we get the pmf by assigning each outcome the same probability of 1 divided by the number of outcomes. An example would be flipping a fair coin or rolling a fair die; the word fair suggests uniformity. However, we cannot define the pmf this way for a range such as ; think about how you could prove this.
Nonetheless, we still define the uniform distribution for a continuous random variable. Suppose Range( ) = . Then, we assume each number in the interval is equally likely and we get the pdf to be: The cdf is then defined as:We write , with:
-
,
-
As we might expect, the expectation is the average of the two end points, so right in the middle.
The Uniform distribution is often called the rectangular distribution because of how the graph of the pdf looks
We see how it is a horizontal line between and , and then zero elsewhere. The area under the curve being 1 is also clear: the base is and the height is
A random number generator is a common example of a continuous uniform distribution. The following was Q6 in the 2023 Further Maths paper.
Question 7. A game consists of two rounds. The first round of the game uses a random number generator to output the score , a real number between 0 and 10.
- Find
The second round of the game uses an unbiased 6-sided die to give the score . The variables and are independent.
-
Find the mean total score of the game
-
Find the variance of the total score of the game
(b) The total score is . We know . Using the formula for a uniform, . We previously found , and so .
(c) We also have that . We previously found . Then, for a continuous uniform distribution:
Hence,
Question 4 was originally presented in the form of a discrete uniform distribution rather than as an n-sided die; they are equivalent.
Poisson Distribution
Often, we don’t have a fixed number of trials but instead a fixed interval of time, and we wish to count the number of events that happen in that interval. The Poisson distribution is applicable in these cases.

Rather than considering whether United win or lose a match, suppose we want to predict how many goals they score in a game. There is no fixed number of trials; the Binomial can model how many times in 10 games they score more than 2 goals, for example, but not how likely a number of goals is in one game.
Instead, we consider the average number of goals they score in a game and use that to find the probability. This is the Poisson and requires the following assumptions.
As there is no fixed number of trials, there is no upper limit to the amount of events that can happen. In theory, United could score any amount of goals, although physical (and tactical) limitations prevent that in reality.
The average rate of goals should be assumed consistent within the interval. If United score 1.5 goals on average. this rate should be the same no matter what point in the game we are at. This is often a simplification; it is more common to score goals near the end of a game than the start.
Finally, the time taken for a goal should be independent of when the last goal was scored. It doesn’t matter if the last goal was 5 minutes or 50 minutes ago, the probability I score now is the same.
Essentially, the Poisson is applicable when we have an average rate of occurrence, each occurrence has no effect on the others, and there is no maximum that can happen.
Definition 17. Suppose a random variable counts how often an event occurs over a fixed interval. follows a Poisson distribution if:
-
Range
-
has a constant mean rate over the fixed interval
-
is independent of the time since the last event
For a distribution, we have:
So for a Poisson distribution the mean and variance are equal.

Note: is not just a count, but a count over a fixed interval. It is the average number of goals scored over a 90 minute game. If the rate is given for one interval but we are interested in another, we can scale the rate . For example, we can divide by to get the goals per minute and go from there. It is important to make sure our rate matches the interval.
Note: The interval of interest for the Poisson is generally time, but it could be something like length (such as accidents over a stretch of road) or area (such as trees in a forest). In linear modeling courses you will encounter such cases.
There is a link between the Poisson and Binomial.
Example 6. Suppose we are interested in how often customers visit a shop. One approach would be to follow people who walk by the shop, and count how many come in and how many walk by. This would be a random variable, where is the probability someone who walks by comes in and would be our final estimate.
Another approach would be to wait for one hour in the shop and count how many people enter. This would be a random variable, where is the average amount of people who come in every hour and would be our final estimate.
In fact, when is large and is small (so is close to 1 and ), a distribution is a good approximation of the Binomial distribution. I like to think of the Poisson as the Binomial with an infinite number of trials; in each tiny interval of time you are checking whether the event happens or not. So when is large they are similar.
So when is large, the Poisson and Normal are both good approximations of the Binomial. As you might expect, this means the Normal is a good approximation of the Poisson as well when is large.
Exponential Distribution
We saw the Poisson distribution as a way of counting how many events occur in a fixed interval. An alternative but equivalent way of thinking about this problem is as the time between events.
The number of customers entering a queue is often modeled with a Poisson. Suppose we know in one hour period that 2 people entered. Instead, I could say that 15 minutes in, one person came. Then, 35 minutes after that, another person came. 20 minutes after that a third person arrived. This third person came after the hour and so we know only two people came in the first hour.
We write .

Note: Just as with the Poisson, the rate parameter is the rate per unit time. It is common to transition between the Exponential and Poisson distributions and you should be careful with what represents.
The Exponential distribution has a property similar to the Poisson.
That is, it doesn’t matter how long has passed to the current point, we only need to look from now onwards.
We have for :
The result for expectation should make sense when we think of the link to the Poisson; if we expect events to occur in one unit of time and the rate is constant, then we should expect the average time between each event to be .
The Exponential distribution is very important when measuring times between events, such as in risk analysis for insurance companies.
Key Takeaways:
-
The Binomial distribution is for counting the number of successes in independent Bernoulli trials
-
The Normal distribution is common, simple and a good approximation of some discrete distributions
-
The Uniform distribution assumes each outcome is equally likely
-
The Poisson distribution counts the number of events in a fixed interval of time
-
The Exponential distribution measures the time between events, rather than counting them over time like the Poisson
Worked Questions
The following was Q8 in 2019 Further Maths Paper 3.
Question 8. The phone calls an office receives is modelled by a Poisson distribution with a mean rate of 3 calls per 10 minutes.
-
Find the probability exactly 2 calls are received in 10 mins.
-
Find the probability the office receives more than 30 calls in an hour.
The office manager splits an hour into 6 10-minute periods and records the number of calls taken in each period.
- Find the probability that the office receives exactly 2 calls in a 10-minute period exactly twice within an hour.
The office has just received a call.
- Find the the probability the next call is received more than 10 minutes later
Mahah arrives at the office 5 minutes after the last call was received.
- State the probability that the next call received by the office is received more than 10 minutes later. Explain your answer.
Solution 8. Click Here for Video
The following is Q11 in STEP II 2023.
Question 9.
Solution 9. Click Here for Video