Chapter 4: Continuous Random Variables andTheir Probability Distributions
STAT6039 Principles of Mathematical Statistics
Cumulative Distribution Function
The cumulative distribution function (cdf) of a random variable Y is
defined to be
F (y) = P (Y ≤ y), for − ∞ < y < ∞.
We may also write F (y) as FY (y).
1 / 71
Cumulative Distribution Function
Example 1: Suppose that Y ∼ Bin(2, 1/2). Find and sketch F (y).
Solution:
The probability function of Y is given by f (y) =
2
y
1 y
2
1 2−y
which
2
yields P (Y = 0) = 1/4, P (Y = 1) = 1/2, P (Y = 2) = 1/4.
For any y < 0, P (Y ≤ y) = 0 since the only values of Y that are
assigned positive probabilities are 0, 1, and 2 and none of these values are
less than or equal to y if y < 0.
For any 0 ≤ y < 1, P (Y ≤ y) = P (Y = 0) = 1/4,
For any 1 ≤ y < 2, P (Y ≤ y) = P (Y = 0) + P (Y = 1) = 3/4,
For any y ≥ 2, P (Y ≤ y) = P (Y = 0) + P (Y = 1) + P (Y = 2) = 1.
2 / 71
Cumulative Distribution Function
Solution (continued):
In general, F (y) = P (Y ≤ y) =
0
1
4
3
4
1
for y < 0,
for 0 ≤ y < 1,
for 1 ≤ y < 2,
for y ≥ 2.
3 / 71
Cumulative Distribution Function
Note:
From Example 1, it is clear that the cumulative distribution function stays
flat between the possible values of Y and increases in jumps or steps at
each of the possible values of Y .
Functions that behave in such a manner are called step functions.
Cumulative distribution functions of discrete random variables are always
step functions because the cumulative distribution function increases only at
the finite or countable number of points with positive probabilities.
4 / 71
Cumulative Distribution Function
Theorem 1 (Properties of Cumulative Distribution Function):
If F (y) is a cumulative distribution function, then:
1. F (−∞) ≡ limy→−∞ F (y) = 0.
2. F (∞) ≡ limy→∞ F (y) = 1.
3. F (y) is a nondecreasing function of y, i.e., if y1 and y2 are any values
such that y1 < y2 , then F (y1 ) ≤ F (y2 ).
4. F (y) is right continuous, i.e. limδ→0+ F (y + δ) = F (y). (In Example
1 this corresponds to the fact that F (0) = 0.25, not 0.)
5 / 71
Continuous Random Variable
For a continuous random variable:
• Sample space is a continuous interval.
• There are an infinite number of possible outcomes and they cannot be
counted.
• P (Y = y) = 0 for all y. We need different rules for doing probability
calculations. Focus on P (Y ≤ y) instead.
A random variable Y is said to be continuous if its cdf F (y) is
continuous everywhere.
Note: If P (Y = y0 ) = p0 > 0, then F (y) would have a discontinuity (jump)
of size p0 at the point y0 , violating the assumption that F (y) was continuous.
6 / 71
Continuous Random Variable
Example 2: Let Y be a number chosen randomly between 0 and 1. Find Y ’s
cdf. Is Y a cts rv?
Solution:
F (0.1) = P (Y ≤ 0.1) = 0.1,
F (0.5) = P (Y ≤ 0.5) = 0.5,
F (0.9) = P (Y ≤ 0.9) = 0.9, etc. Thus, we conclude
0, for y < 0,
F (y) = y, for 0 ≤ y ≤ 1,
1, for y > 1.
7 / 71
Probability Density Function
If there exists a nonnegative function f (y) such that,
Z y
F (y) =
f (t)dt,
−∞
we call f (y) the probability density function (pdf) of continuous random
variable Y .
For any y, if the derivative of F (y) exists,
f (y) =
dF (y)
= F ′ (y).
dy
We may also write it as fY (y).
Note: The area under the pdf of Y to the left of y is just F (y).
8 / 71
Probability Density Function
Theorem 2 (Properties of Probability Density Function):
If f (y) is a pdf of a continuous random variable Y , then
1. f (y) ≥ 0 for all y, −∞ < y < ∞.
R∞
2. −∞ f (y)dy = 1.
Note:
Pdf f (y) may be greater than 1 and need not be everywhere continuous.
The total area under the pdf equals 1.
9 / 71
Probability Density Function
Example 2 (continued): Let Y be a number chosen randomly between 0
and 1. Find the pdf of Y and graph it.
Solution:
Because the pdf f (y) is the derivative of F (y) when the derivative exists.
Thus,
0, for y < 0,
f (y) = F ′ (y) = 1, for 0 < y < 1,
0, for y > 1,
and f (y) is undefined at y = 0 and y = 1.
10 / 71
Probability Density Function
Solution (continued):
11 / 71
Conventions and Simplifications for Notations
• It may be convenient to consider undefined values of a pdf as being
equal to zero.
• It may be convenient not to specify where a pdf is 0, nor where a cdf is
0 or 1. Thus we may write: F (y) = y, 0 ≤ y ≤ 1 and
f (y) = 1, 0 ≤ y ≤ 1 in Example 2.
• These details have no effect on calculations if considering a
continuous distribution, but may be important when considering a
discrete or mixed distribution.
• Graphs and formulae of continuous pdfs and cdfs can be simplified
accordingly.
12 / 71
Probability Density Function
Theorem 3:
If a continuous random variable Y has pdf f (y) and a < b, then the
probability that Y falls in the interval [a, b] is
Z b
P (a ≤ Y ≤ b) =
f (y)dy.
a
Note:
Since P (Y = a) = P (Y = b) = 0, the above theorem implies that
P (a ≤ Y ≤ b) = P (a < Y ≤ b) = P (a ≤ Y < b) = P (a < Y < b).
13 / 71
Probability Density Function
Example 3: Given f (y) = cy 2 , 0 ≤ y ≤ 2, find the value of c for which
f (y) is a valid pdf of a random variable Y and then calculate P (1 < Y < 2).
Solution:
We require a value for c such that
R2
R∞
3 2
1 = −∞ f (y)dy = 0 cy 2 dy = cy3 0 = 8c
3 .
Then we find that c = 83 so that f (y) = 38 y 2 , 0 ≤ y ≤ 2.
Thus, we have
Z 2
P (1 < Y < 2) =
1
3 2
1 2
7
y dy = y 3 1 = .
8
8
8
14 / 71
Expected Value
15 / 71
Expected Value of a Continuous Random Variable
Let Y be a continuous random variable with pdf f (y). Then the expected
value of Y , E(Y ), is defined to be
Z ∞
E(Y ) =
yf (y)dy.
−∞
provided that the integral exists.
Let g(Y ) be a function of Y . Then the expected value of g(Y ) is given by
Z ∞
E[g(Y )] =
g(y)f (y)dy.
−∞
provided that the integral exists.
16 / 71
Properties of Expected Value
Theorem 4:
Let a, b, c be constants and let g(Y ), g1 (Y ), g2 (Y ), ..., gk (Y ) be functions
of a continuous random variable Y . Then the following results hold:
1. E(c) = c.
2. E[ag(Y ) + b] = aE[g(Y )] + b.
3. E[g1 (Y ) + · · · + gk (Y )] = E[g1 (Y )] + · · · + E[gk (Y )].
Proof: Similar to the proof of the expected values of a discrete random
variable by replacing the sum with the integral.
17 / 71
Moments
The kth (raw) moment of a random variable Y : µ′k = E(Y k ).
The kth central moment of a random variable Y : µk = E[(Y − µ)k ].
Note:
µ′1 = µ and µ1 = 0.
V ar(Y ) = σ 2 = µ2 = µ′2 − µ2 .
18 / 71
Moments
Example 4: In Example 3, we determined that f (y) = 38 y 2 for 0 ≤ y ≤ 2 is
a valid pdf. If the random variable Y has this pdf, find µ = E(Y ) and
σ 2 = V ar(Y ).
Solution: By definition, we have
Z 2
2
3 2
31 4
µ = E(Y ) =
y
y dy =
y
= 1.5.
8
84 0
0
The variance of Y can be found once we determine E(Y 2 ).
2
Z 2
E(Y ) =
y
0
2
2
31 5
3 2
y dy =
y
= 2.4.
8
85 0
Thus, σ 2 = V ar(Y ) = E(Y 2 ) − µ2 = 2.4 − 1.52 = 0.15.
19 / 71
Uniform Distribution
20 / 71
Uniform Distribution
A random variable Y has a uniform distribution with parameters a and
b if and only if its pdf has the form
f (y) =
1
,a < y < b
b−a
where −∞ < a < b < ∞.
We write Y ∼ U nif orm(a, b) or Y ∼ U nif (a, b) or Y ∼ U (a, b).
We call a the lower bound parameter, and we call b the upper bound
parameter.
If U ∼ U (0, 1), we say U has the standard uniform distribution.
21 / 71
Uniform Distribution
Example 5: Suppose Y ∼ U (a, b). Find the cdf of Y .
Solution:
Z y
F (y) =
Z y
f (t)dt =
−∞
a
y−a
1
dt =
, a < y < b.
b−a
b−a
22 / 71
Uniform Distribution
Theorem 5: Let Y ∼ U (a, b). Then
µ = E(Y ) =
a+b
(b − a)2
and σ 2 = V ar(Y ) =
.
2
12
Proof: Left as an exercise.
23 / 71
Uniform Distribution
Example 6: The length of time patients wait to see a doctor is uniformly
distributed between 40 mins and 3 hrs. Find the probability of waiting
between 1 and 2 hrs, given you waited over 90 mins.
Solution: Let Y be the waiting time (in mins) so we have Y ∼ U (40, 180)
1
with pdf f (y) = 140
, 40 < y < 180. Then,
90) = P (9090) .
Since
P (90 < Y
R 180
1
180−90
9
= 14
,
140 dy =
140
R 120 1
3
< 120) = 90 140 dy = 120−90
= 14
,
140
P (Y > 90) =
90
we get
3/14
P (60 < Y < 120|Y > 90) = 9/14
= 1/3.
24 / 71
More Discussion
• All intervals of the same length on the distribution’s support are
equally probable.
• Often used for bounded data. In practice, if we randomly select a value
from some fixed interval, say (a, b), then the value follows U (a, b).
• The standard uniform distribution U (0, 1) can be used to generate
some random variables following other distributions. If U ∼ U (0, 1),
then random variable FY−1 (U ) has the same distribution of Y .
25 / 71
Normal Distribution
26 / 71
Normal Distribution
A random variable Y is said to have a normal distribution if and only if,
for σ > 0 and −∞ < µ < ∞, the pdf of Y is
f (y) = √
(y−µ)2
1
e− 2σ2 , −∞ < y < ∞.
2πσ
We write Y ∼ N (µ, σ 2 ).
Theorem 6: Let Y ∼ N (µ, σ 2 ). Then
E(Y ) = µ and V ar(Y ) = σ 2 .
Note: We call µ the mean parameter and σ 2 the variance parameter. Or we
call σ the standard deviation parameter.
27 / 71
Normal Distribution
The pdf of normal distribution is bell-shaped, symmetric about µ, reaches
highest point at y = µ, tends to zero as y → ±∞.
28 / 71
Normal Distribution
Changing µ (different means) will shift the pdf curve left and right.
Changing σ 2 (different variances) will make the pdf curve become more
peaked or more flattened.
29 / 71
Normal Distribution
The cdf function of Y ∼ N (µ, σ 2 ) is
Z y
F (y) =
√
−∞
(t−µ)2
1
e− 2σ2 dt.
2πσ
For any −∞ < a < b < ∞
P (a < Y < b) = F (b) − F (a) =
Rb
a
√ 1 e−
2πσ
(t−µ)2
2σ 2
dt.
However, the closed-form expression for this integral does not exist;
hence, its evaluation requires the use of numerical integration techniques.
30 / 71
Standard Normal Distribution
If Z ∼ N (0, 1), we say that Z has the standard normal distribution.
The pdf can be written as
y2
1
ϕ(y) = √ e− 2 .
2π
The cdf can be written as
Z y
Φ(y) =
t2
1
√ e− 2 dt.
2π
−∞
31 / 71
z-Table
The table of probabilities for a standard normal distribution Z ∼ N (0, 1)
is called a z-table (Table 4 in the “statistical table” file) and it lists
probabilities of the form P (Z > z) for various values of z. Some books
have tables of P (Z < z).
From the table, for example we would get
P (Z > 1.67) = 0.0475,
P (Z > 1.96) = 0.0250.
Using symmetry, we get
Φ(−1.67) = P (Z < −1.67) = P (Z > 1.67) = 0.0475,
Φ(1.67) = P (Z < 1.67) = 1 − P (Z > 1.67) = 1 − 0.0475 = 0.9525,
P (0 < Z < 1.67) = P (Z < 1.67)−P (Z < 0) = 0.9525−0.5 = 0.4525,
P (−1.96 < Z < 1.96) = 1 − 2P (Z > 1.96) = 1 − 2 × 0.0250 = 0.95.
32 / 71
Quantile
Let Y be a random variable with cdf F (y). For each p strictly between 0
and 1, define F −1 (p) to be the smallest value y such that F (y) ≥ p. Then
F −1 (p) is called the p quantile of Y or the 100p-th percentile of Y .
If Y is a continuous random variable, p quantile of Y or the 100p-th
percentile of Y is the value y such that F (y) = p.
For example, the median (50th percentile) of N (µ, σ 2 ) is µ and the
median (50th percentile) of U (a, b) is a+b
2 .
33 / 71
z-Table
z-Table can also be used to find quantiles.
The (lower) p quantile function of Z is Φ−1 (p).
For example, Φ−1 (0.9525) = 1.67 and Φ−1 (0.0475) = −1.67.
The upper p quantile function of Z is zp = Φ−1 (1 − p), i.e.
P (Z < zp ) = 1 − p so that P (Z > zp ) = p.
For example, z0.0475 = 1.67
and z0.025 = 1.96 (a common fact everyone should memorise).
Note: When looking up a probability in order to find zp , look up the closest
probability in the table, or if the probability lies exactly in the middle
between two probabilities in the table, choose the mid-point of the two
corresponding z-values as zp .
34 / 71
Normal Distribution
Theorem 7: If Y ∼ N (µ, σ 2 ), then the linear transformation
Z=
Y −µ
∼ N (0, 1)
σ
standardizes Y to be a N (0, 1) random variable.
Note: “Standardizing” a random variable usually means subtracting its mean
and then (after that) dividing by the random variable’s standard deviation.
35 / 71
Normal Distribution
Example 7: Suppose that Y ∼ N (10, 16). Find P (7 < Y < 11).
∼ N (0, 1).
Solution: Since Y ∼ N (10, 16), we have Z = Y −10
4
Then,
7 − 10
Y − 10
11 − 10
P (7 < Y < 11) = P
<
<
4
4
4
= P (−0.75 < Z < 0.25)
= P (Z < 0.25) − P (Z < −0.75)
=
1 − P (Z > 0.25) − P (Z > 0.75)
=
1 − 0.4013 − 0.2266
=
0.3721.
36 / 71
More Discussion
• Normal distribution is symmetric and has bell-shape. Mean, median
and mode (the value such that f (y) is maximized) are equal to µ.
• If you find the histogram of a sample has a bell-shape and quite
symmetric, then normal distribution can be used to model such data.
R
2
1
• We can use the property that ∞ √ 1 2 e− 2σ2 (y−µ) dy = 1 to
−∞ 2πσ
R∞
2
calculate integral −∞ e−ay +by dy where a > 0.
More properties about normal distribution will be discussed in the
following chapters.
37 / 71
Gamma Distribution
38 / 71
Gamma Distribution
A random variable Y is said to have a gamma distribution with
parameters α > 0 and β > 0 if and only if the pdf of Y is
f (y) =
where
Z ∞
Γ(α) =
y α−1 e−y/β
, 0 ≤ y < ∞,
β α Γ(α)
tα−1 e−t dt
(gamma function).
0
We write Y ∼ Gamma(α, β), or Y ∼ G(α, β).
Note: We call α the shape parameter and β the scale parameter.
39 / 71
Gamma Distribution
Properties of gamma function:
Γ(α) = (α − 1)Γ(α − 1)
for any α > 1.
Γ(α) = (α − 1)! for any postive integer α.
√
Γ(1/2) = π = 1.7725 (to four decimals).
√
Γ(1.5) = 0.5Γ(0.5) = 0.5 π = 0.8862 (to four decimals).
Γ(2.5) = 1.5Γ(1.5) = 1.3293
(to four decimals).
Γ(α) has a minimum of 0.8856 at α = 1.47 (to two decimals).
40 / 71
Gamma Distribution
Note: Nonnegative and right skewed.
41 / 71
Gamma Distribution
Theorem 8: Let Y ∼ G(α, β). Then
µ = E(Y ) = αβ and σ 2 = V ar(Y ) = αβ 2 .
Theorem 9: Let Y ∼ G(α, β). Then
X = kY ∼ G(α, kβ)
42 / 71
Chi-square Distribution
If Y ∼ G(m/2, 2), we say that Y has the chi-square distribution with
parameter m. We write Y ∼ χ2 (m) and call m the degrees of freedom.
Theorem 10: Let Y ∼ χ2 (m). Then
µ = E(Y ) = m and σ 2 = V ar(Y ) = 2m.
Note: Quantiles of χ2 (m) can be obtained from Table 6 in the “statistical
table” file.
43 / 71
Exponential Distribution
If Y ∼ G(1, β), we say that Y has the exponential distribution with
parameter β. We write Y ∼ Exp(β).
The pdf of Y is f (y) = β1 e−y/β , y > 0.
If Y ∼ Exp(1), we say Y has the standard exponential distribution.
Theorem 11: Let Y ∼ Exp(β). Then
µ = E(Y ) = β and σ 2 = V ar(Y ) = β 2 .
Note: Exp(2) = G(1, 2) = χ2 (2).
44 / 71
Exponential Distribution
Example 8: Find the cdf of Y ∼ Exp(β). Then show that if a > 0 and
b > 0, P (Y > a + b|Y > a) = P (Y > b).
Solution: The cdf of Y is
y
Ry
F (y) = 0 β1 e−t/β dt = −e−t/β = 1 − e−y/β , y > 0.
0
We have P (Y > y) = 1 − P (Y ≤ y) = 1 − F (y) = e−y/β , y > 0.
Therefore,
P (Y > a + b|Y > a)
=
P (Y > a + b)
P ({Y > a + b} ∩ {Y > a}
=
P (Y > a)
P (Y > a)
e−(a+b)/β
e−a/β
−b/β
= e
= P (Y > b).
=
The conditional probability does not depend on the past information a.
45 / 71
Exponential Distribution
Memoryless Property: P (Y > a + b|Y > a) = P (Y > b).
The exponential distribution is memoryless because the past has no
impacts on its future behaviour.
For example, suppose that jobs in our system have exponentially
distributed service times. If we have a job that’s been running for one hour,
the probability that a job runs for two additional hours is the same as the
probability that it ran for two hours originally, regardless of how long it’s
been running.
Every instant is like the beginning of a new random period, which has the
same distribution regardless of how much time has already elapsed.
46 / 71
More Discussion
Gamma Distribution:
• The shape of pdf is right skewed so that gamma distribution is often
used for modeling nonnegative skewed data, like the size of insurance
claims and rainfalls, etc.
• In Bayesian analysis, the gamma distribution is widely used as a
conjugate prior for some parameters. We can also use the property that
R ∞ yα−1 e−y/β
R ∞ a −by
dy where
β α Γ(α) dy = 1 to calculate integral 0 y e
0
a > −1, b > 0.
47 / 71
More Discussion
Chi-Squared Distribution:
• One of the most widely used probability distributions in inferential
statistics, notably in hypothesis testing and construction of confidence
intervals.
• The squared sum of m independent standard normal random variables
follows a chi-squared distribution with m degrees of freedom. (e.g.
Pm
Y = i=1 Zi ∼ χ2 (m), where Zi , i = 1, . . . , m are independent
standard normal random variables.)
48 / 71
More Discussion
Exponential Distribution:
• The exponential distribution are often used to model waiting times.
• For example, in queuing theory, the service times of agents in a system
(e.g. how long it takes for a bank teller to serve a customer) are often
modeled as exponentially distributed random variables.
49 / 71
Beta Distribution
50 / 71
Beta Distribution
A random variable Y is said to have a beta distribution with parameters
α > 0 and β > 0 if and only if the pdf of Y is
f (y) =
y α−1 (1 − y)β−1
, 0 < y < 1,
B(α, β)
where
B(α, β) =
Γ(α)Γ(β)
Γ(α + β)
(beta function).
We write Y ∼ Beta(α, β).
Note: If α = β = 1, f (y) = 1, 0 < y < 1. Thus, Beta(1, 1) = U (0, 1).
51 / 71
Beta Distribution
52 / 71
Beta Distribution
Theorem 12: Let Y ∼ B(α, β). Then
µ = E(Y ) =
αβ
α
and σ 2 = V ar(Y ) =
.
2
α+β
(α + β) (α + β + 1)
53 / 71
Beta Distribution
Example 9: A gasoline wholesale distributor has bulk storage tanks that
hold fixed supplies and are filled every Monday. Of interest to the wholesaler
is the proportion of this supply that is sold during the week. Over many
weeks of observation, the distributor found that this proportion could be
modeled by a beta distribution with α = 4 and β = 2. Find the probability
that the wholesaler will sell at least 90% of her stock in a given week.
Solution: If Y denotes the proportion sold during the week, then
Y ∼ Beta(4, 2). So
(1−y)2−1
Γ(6)
= Γ(4)Γ(2)
(y 3 − y 4 ) = 20(y 3 − y 4 ), 0 < y < 1.
B(4,2)
(
)
1
1
R1
1 4
1 5
3
4
> 0.9) = 0.9 20(y − y )dy = 20 4 y
− 5y
= 0.08.
0.9
0.9
f (y) = y
P (Y
4−1
It is not very likely that 90% of the stock will be sold in a given week.
54 / 71
More Discussion
• Often used to model proportion or percentage data since the possible
value for beta distribution is in (0,1).
• The beta distribution has an important application in the theory of
order statistics of uniform distributions.
• In Bayesian analysis, the beta distribution is widely used as a conjugate
prior of parameter p for the Bernoulli, binomial, negative binomial and
geometric distributions.
R α−1
β−1
• We can use the property that 1 y (1−y) dy = 1 to calculate
B(α,β)
0
R1
integral 0 y a (1 − y)b dy where a > −1, b > −1.
55 / 71
Moment Generating Functions
56 / 71
Moment Generating Functions
The moment generating function (mgf) of a random variable Y is defined
to be m(t) = E(etY ). It exists if there is a constant b > 0 such that m(t) is
finite for |t| ≤ b.
Theorem 13: Let Y be a continuous random variable with pdf f (y) and
g(Y ) be a function of Y . Then, the moment generating function of g(Y ) is
E[e
tg(Y )
Z ∞
]=
etg(y) f (y)dy.
−∞
57 / 71
Moment Generating Functions
Two important applications:
1. To compute raw moments, according to the equation µ′k = m(k) (0).
Here, m(k) (0) denotes the kth derivative of m(t) evaluated at t = 0, i.e.
dk m(t)
dtk
. We may also write m(0) (t) as m(t), m(1) (t) as m′ (t) and
t=0
m(2) (t) as m′′ (t), etc.
2. “If two random variables X and Y have the same mgf, then they also
have the same distribution”. It follows by “the uniqueness theorem”, a result
in pure mathematics.
58 / 71
Moment Generating Functions
Example 10: Find the moment generating function of a gamma distributed
random variable and calculate its µ′k .
Solution:
#
Z ∞
1
1
y α−1 e−y/β
y α−1 exp −y
dy = α
−t
dy
α
β Γ(α)
β Γ(α) 0
β
0
Z ∞
1
−y
= α
y α−1 exp
dy
β Γ(α) 0
β/(1 − βt)
Z
{β/(1 − βt)}α ∞
1
−y
α−1
=
y
exp
dy
α
βα
{β/(1 − βt)} Γ(α)
β/(1 − βt)
0
Z ∞
ety
m(t) = E etY =
”
{β/(1 − βt)}α
×1
βα
1
= (1 − βt)−α ,
=
(1 − βt)α
=
1
α−1 exp
where {β/(1−βt)}
α Γ(α) y
h
t < 1/β
−y
β/(1−βt)
i
is the pdf of G(α, β/(1 − βt)).
59 / 71
Moment Generating Functions
Solution (continued): Since m(t) = (1 − βt)−α , we have
dm(t)
= (−α)(1 − βt)−(α+1) (−β) = αβ(1 − βt)−(α+1)
dt
′
dm (t)
m′′ (t) = dt = −(α + 1)αβ(1 − βt)−(α+2) (−β) = α(α + 1)β 2 (1 − βt)−(α+2) .
m′ (t) =
m(3) (t) =
dm′′ (t)
= α(α + 1)(α + 2)β 3 (1 − βt)−(α+3) .
dt
In general,
m(k) (t) =
dk m(t)
= α · · · (α + k − 1)β k (1 − βt)−(α+k) .
dtk
Thus, we have µ′k = m(k) (0) = α · · · (α + k − 1)β k .
For example, µ = µ′1 = αβ and µ′2 = α(α + 1)β 2 so that
σ 2 = µ′2 − µ2 = α(α + 1)β 2 − α2 β 2 = αβ 2 .
60 / 71
Summary of Contiuous Random Variables
The table is from page 838 of: Wackerly, Mendenhall and Scheaffer (2008), Mathematical Statistics with Applications.
61 / 71
Chebyshev’s Theorem
62 / 71
Chebyshev’s Theorem (Review)
Theorem 14: Let Y be a random variable with mean µ and finite variance
σ 2 . Then, for any constant k > 0
P (|Y − µ| < kσ) ≥ 1 −
1
k2
or
P (|Y − µ| ≥ kσ) ≤
1
.
k2
Note:
The result applies to any probability distribution (both discrete and
continuous).
63 / 71
Chebyshev’s Theorem
Example 11: Suppose that experience has shown that the length of time Y
(in minutes) required to conduct a periodic maintenance check on a dictating
machine follows a gamma distribution with α = 3.1 and β = 2. A new
maintenance worker takes 22.5 minutes to check the machine. Does this
length of time to perform a maintenance check disagree with prior
experience?
Solution:
Since Y ∼ G(3.1, 2), we have µ = αβ = 6.2 and σ 2 = αβ 2 = 12.4. It
√
follows that σ = 12.4 = 3.52.
By Chebyshev’s Theorem, we have
1
P (Y ≥ 22.5 or Y ≤ 10.1) = P (|Y − µ| ≥ 4.63σ) ≤ 4.63
2 = 0.0466.
64 / 71
Chebyshev’s Theorem
Solution (continued):
Therefore, P (Y ≥ 22.5) ≤ P (Y ≥ 22.5 or Y ≤ 10.1) ≤ 0.0466.
This probability is based on the assumption that the distribution of
maintenance times has not changed from prior experience. Since that
P (Y ≥ 22.5) is small, we must conclude either that the time of maintenance
taken by our new maintenance worker has generated by chance a lengthy
maintenance time that occurs with low probability or that the new worker is
somewhat slower than preceding ones. Considering the low probability of
P (Y ≥ 22.5), we favor the latter view.
65 / 71
Mixed Distribution
66 / 71
Mixed Distribution
A random variable Y is mixed and has a mixed distribution if its cdf is
continuous but not all flat over some intervals and also has some isolated
points with positive probabilities.
For any Y is mixed if its cdf has the form
F (y) = cFX (y) + (1 − c)FZ (y),
where 0 < c < 1, FX (y) is the cdf of a discrete random variable X, FZ (y)
is the cdf of a continuous random variable Z.
Note: Y has a mixed distribution if it is neither discrete not continuous but a
“mixture” of those two kinds, in the sense that the cdf of Y is a weighted
average of a discrete cdf and a continuous cdf.
67 / 71
Mixed Distribution
Theorem 15: If Y is mixed with cdf F (y) = cFX (y) + (1 − c)FZ (y),
where 0 < c < 1, FX (y) is the cdf of a discrete random variable X, FZ (y)
is the cdf of a continuous random variable Z, then:
E(Y ) = cE(X) + (1 − c)E(Z).
And for any function g(Y ),
E[g(Y )] = cE[g(X)] + (1 − c)E[g(Z)].
68 / 71
Mixed Distribution
Example 12: A light bulb has a 20% chance of failing immediately, and
otherwise its lifetime follows an exponential distribution with mean 100
hours. Find the cdf, mean and standard deviation of Y , the overall lifetime
of the lightbulb.
Solution: Let Y be the overall lifetime of the lightbulb (in 100 hours) so that Y is nonnegative.
Since a light bulb has a 20% chance of failing immediately, we know P (Y = 0) = 0.2 so that
P (Y ̸= 0) = P (Y > 0) = 0.8.
For any y < 0, F (y) = 0. For any y ≥ 0, we have
F (y)
=
P (Y ≤ y)
=
P (Y ≤ y|Y = 0)P (Y = 0) + P (Y ≤ y|Y > 0)P (Y > 0)
=
1 × 0.2 + (1 − e−y ) × 0.8 = 0.2 + 0.8(1 − e−y ).
69 / 71
Mixed Distribution
Solution (continue):
Then, F (y) = 0.2FX (y) + (1 − 0.2)FZ (y),
where X ∼ Bern(0) and Z ∼ Exp(1).
It is easy to get E(X) = 0, E(X 2 ) = 0, E(Z) = 1,
E(Z 2 ) = V ar(Z) + [E(Z)]2 = 1 + 12 = 2.
Thus,
E(Y ) = 0.2E(X) + 0.8E(Z) = 0.8 (i.e. 80 hours),
E(Y 2 ) = 0.2E(X 2 ) + 0.8E(Z 2 ) = 1.6,
σ 2 = V ar(Y ) = E(Y 2 ) − [E(Y )]2 = 1.6 − 0.82 = 0.96,
√
σ = 0.96 = 0.9798 (i.e. 97.98 hours).
70 / 71
Conclusion
• Be able to distinguish different continuous random variables and apply
those in real applications.
• Be familiar with the probability density function, expected value,
variance and moment generating function for each commonly used
continuous discrete random variable.
• Know how to calculate expected value, variance, quantile and moment
generating function by using definitions and properties.
• Can apply Chebyshev’s Theorem to get the bound of the probability.
• Know how to obtain the cdf, expected value and variance for mixed
distribution.
71 / 71
Chapter 5: Multivariate Probability Distributions
STAT6039 Principles of Mathematical Statistics
Bivariate and Multivariate
Random Variable
1 / 65
Joint Probability Mass Function
Example 1: A die is rolled once. Let X = number of 6s and Y = number of
even numbers. Find the joint probability P (X = x, Y = y) where x, y are
the possible values of X and Y , respectively.
Solution:
Outcome
1
2
3
4
5
6
Value of X
0
0
0
0
0
1
Value of Y
0
1
0
1
0
1
P (X = 1, Y = 1) = P ({6}) = 16
P (X = 0, Y = 1) = P ({2}) + P ({4}) = 13
P (X = 0, Y = 0) = P ({1}) + P ({3}) + P ({5}) = 12
2 / 65
Joint Probability Mass Function
Let X and Y be discrete random variables. The joint (or bivariate)
probability (mass) function of X and Y is defined to be
f (x, y) = P (X = x, Y = y), −∞ < x < ∞, −∞ < y < ∞.
Theorem 1: If X and Y are discrete random variables with joint pmf
f (x, y), then
1. 0 ≤ f (x, y) ≤ 1 for all x, y.
P
2. all x,y f (x, y) = 1, where the summation is over all values (x, y) that
are assigned nonzero probabilities.
Note: We may also write this function as fX,Y (x, y) or p(x, y) or
pX,Y (x, y) and refer it to the joint pmf of X and Y .
3 / 65
Joint Probability Mass Function
Example 2: A die is rolled once. Let X = number of 6s and Y = number of
even numbers. Find the joint pmf of X and Y .
Solution:
1/2
The joint pmf of X and Y is f (x, y) = 1/3
1/6
Table:
f (x, y)
y=0
y=1
x=0
1/2
1/3
x=1
x=y=0
x = 0, y = 1
x=y=1
1/6
4 / 65
Joint Cumulative Distribution Function
For any random variables X and Y , the joint (or bivariate) cumulative
distribution function is defined to be
F (x, y) = P (X ≤ x, Y ≤ y), for − ∞ < x < ∞, −∞ < y < ∞. We may
also write it as FX,Y (x, y).
Note:
For two discrete variables X and Y , F (x, y) is given by
P
P
F (x1 , y1 ) = x≤x1 y≤y1 f (x, y).
5 / 65
Joint Cumulative Distribution Function
Example 3: Refer to Example 2. Find the joint cdf of X and Y .
Solution: The
joint cdf of X and Y is
0
x < 0 or y < 0 or both
1/2 x ≥ 0, 0 ≤ y < 1
F (x, y) =
5/6 0 ≤ x < 1, y ≥ 1
1
x ≥ 1, y ≥ 1
6 / 65
Joint Cumulative Distribution Function
Four properties of joint cdfs:
1. F (x, y) → 0 as x → −∞, or y → −∞ or both.
2. F (x, y) → 1 as x → ∞ and y → ∞.
3. F (x, y) is a nondecreasing function in both x and y directions.
4. F (x, y) is right continuous in both x and y directions.
7 / 65
Joint Probability Density Function
Let X and Y be continuous random variables with joint cdf F (x, y). If
there exists a nonnegative function f (x, y), such that
Z x Z y
F (x, y) =
f (t1 , t2 )dt2 dt1
−∞
−∞
for all −∞ < x < ∞, −∞ < y < ∞, then X and Y are said to be jointly
continuous random variables. The function f (x, y) is called the joint (or
bivariate) probability density function.
Note: The joint pdf f (x, y) of X and Y can be obtained from its joint cdf
2
F (x,y)
f (x, y) = d dxdy
, wherever the derivative exists.
8 / 65
Joint Probability Density Function
Theorem 2: If X and Y are jointly continuous random variables with joint
pdf f (x, y), then
1. f (x, y) ≥ 0 for all x, y.
R∞ R∞
2. −∞ −∞ f (x, y)dxdy = 1 (the entire volume under the density
surface is 1).
Theorem 3: If X and Y are jointly continuous random variables with joint
pdf f (x, y), then
Z x2 Z y2
P (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) =
f (x, y)dydx
x1
y1
(a volume under the joint pdf).
9 / 65
Joint Probability Density Function
10 / 65
Joint Probability Density Function
Example 4: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = cxy, 0 < x < 2y < 4. Find P (X > 1, Y < 1).
Solution: First, we need to find c.
Z 2 Z 2y
1=
Z 2
cxydxdy =
0
0
0
2y
cx2
2 0
!
Z 2
ydy =
0
2
2cy 3 dy =
2cy 4
= 8c.
4 0
Thus, c = 18 and f (x, y) = 81 xy, 0 < x < 2y < 4.
!
Z 1 Z 2y
Z 1
Z 1
2y
1
x2
(4y 3 − y)
P (X > 1, Y < 1) =
xydxdy =
ydy =
dy
1
1
1
8
16 1
16
1
2
2
2
=
9
.
256
11 / 65
Multivariate Random Variable
The joint cdf of more than one random variable is called a
multivariate cdf.
Suppose we have n random variables Y1 , Y2 , · · · , Yn , then the joint cdf is
defined to be
F (y1 , y2 , · · · , yn ) = P (Y1 ≤ y1 , Y2 ≤ y2 , · · · , Yn ≤ yn ).
Denote Y = (Y1 , Y2 , · · · , Yn )⊤ and let y = (y1 , y2 · · · , yn )⊤ , then the
cdf of Y becomes F (y) = F (y1 , y2 , · · · , yn ), which is defined on
n-dimensional space Rn . We call Y the multivariate random variable. If
n = 2, we often say it is a bivariate random variable.
12 / 65
Multivariate Random Variable
Discrete multivariate random variable:
If multivariate random variable Y = (Y1 , Y2 , · · · , Yn )⊤ can only take a
finite number or a countably infinite sequence of different possible values
(y1 , y2 , · · · , yn )⊤ in Rn , it is a discrete multivariate random variable.
Equivalently, if Y1 , Y2 , · · · , Yn are all discrete random variables, the
vector Y = (Y1 , Y2 , · · · , Yn )⊤ is a discrete multivariate random variable.
The pmf of Y or the joint pmf of Y1 , Y2 , · · · , Yn is defined to be
P (Y = y) = f (y) = P (Y1 = y1 , Y2 = y2 , · · · , Yn = yn ) =
f (y1 , y2 , · · · , yn ).
13 / 65
Multivariate Random Variable
Continuous multivariate random variable:
If Y1 , Y2 , · · · , Yn are all continuous random variables, the vector
Y = (Y1 , Y2 , · · · , Yn )⊤ is a continuous multivariate random variable.
The nonnegative function f (y1 , y2 , · · · , yn ), such that
Z y1 Z y2
Z yn
F (y) = F (y1 , y2 , · · · , yn ) =
···
f (t1 , t2 , · · · , tn )dtn · · · dt2 dt1 ,
−∞
−∞
−∞
is said to be the pdf of Y or the joint pdf of Y1 , Y2 , · · · , Yn .
The pdf of Y or the joint pdf of Y1 , Y2 , · · · , Yn can be derived from
f (y) = f (y1 , y2 · · · , yn ) =
dn F (y1 , y2 , · · · , yn )
dy1 dy2 · · · dyn
at all points y = (y1 , · · · , yn ) where the derivative exists.
14 / 65
Marginal and Conditional
Probability Distributions
15 / 65
Marginal Probability Distributions
Let X and Y be jointly discrete random variables with joint pmf
f (x, y). Then the marginal probability mass functions of X and Y ,
respectively, are given by
fX (x) =
X
f (x, y) and fY (y) =
all y
X
f (x, y).
all x
Let X and Y be jointly continuous random variables with joint pdf
f (x, y). Then the marginal density functions of X and Y , respectively, are
given by
Z ∞
fX (x) =
Z ∞
f (x, y)dy and fY (y) =
−∞
f (x, y)dx.
−∞
16 / 65
Marginal Probability Distributions
Example 5: The joint pmf of X and Y is given below. Please find the
marginal pmf of X and Y .
f (x, y)
y=0
x=0
1/2
x=1
y=1
1/3
1/6
Solution:
fX (0) =
P
fX (1) =
P
fY (0) =
P
fY (1) =
P
all y f (0, y) = f (0, 0) + f (0, 1) = 1/2 + 1/3 = 5/6,
all y f (1, y) = f (1, 1) = 1/6,
all x f (x, 0) = f (0, 0) = 1/2,
all x f (x, 1) = f (0, 1) + f (1, 1) = 1/3 + 1/6 = 1/2.
Therefore, X ∼ Bern(1/6) and Y ∼ Bern(1/2).
17 / 65
Marginal Probability Distributions
Example 5: The joint pmf of X and Y is given below. Please find the
marginal pmf of X and Y .
f (x, y)
y=0
y=1
x=0
1/2
1/3
x=1
1/6
Solution (continued):
Equivalently, what we have done is to compute column and row totals.
f (x, y)
y=0
y=1
fX (x)
x=0
1/2
1/3
5/6
1/6
1/6
x=1
fY (y)
1/2
1/2
18 / 65
Marginal Probability Distributions
Example 6: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = 18 xy, 0 < x < 2y < 4. Find marginal pdfs of X and
Y , respectively.
Solution:
2
Z 2
fX (x) =
1
x y2
x x3
xydy =
= − ,
8 2 x/2
4
64
x/2 8
Z 2y
fY (y) =
0
0 < x < 4.
2y
y3
1
y x2
xydx =
= ,
8
8 2 0
4
0 < y < 2.
19 / 65
Marginal Probability Distributions
Marginal pmf (discrete):
P
P
f1 (y1 ) = all y2 · · · all yn f (y1 , y2 , · · · , yn )
P
P
P
f13 (y1 , y3 ) = all y2 all y4 · · · all yn f (y1 , y2 , · · · , yn )
Marginal pdf (continuous):
R∞
R∞
f1 (y1 ) = −∞ · · · −∞ f (y1 , y2 , · · · , yn )dy2 · · · dyn
R∞
R∞
f13 (y1 , y3 ) = −∞ · · · −∞ f (y1 , y2 , · · · , yn )dy2 dy4 · · · dyn
Marginal cdf (discrete or continuous):
F1 (y1 ) = P (Y1 ≤ y1 ) = P (Y1 ≤ y1 , Y2 < ∞, · · · , Yn < ∞) =
limyj →∞,j=2,··· ,n F (y1 , y2 , · · · , yn )
F13 (y1 ) = P (Y1 ≤ y1 , Y3 ≤ y3 ) = P (Y1 ≤ y1 , Y2 < ∞, Y3 ≤ y3 , Y4 <
∞, · · · , Yn < ∞) = limyj →∞,j=2,4,··· ,n F (y1 , y2 , · · · , yn )
20 / 65
Conditional Probability Distributions
If X and Y are jointly discrete random variables with joint pmf f (x, y)
and marginal pmfs fX (x) and fY (y), respectively, then the conditional
probability mass function of X given Y = y is
f (x|y) = P (X = x|Y = y) =
P (X = x, Y = y)
f (x, y)
=
P (Y = y)
fY (y)
provided that P (Y = y) > 0. We may write it as fX|Y (x|y).
Conditional cdf of X given Y = y is defined accordingly
F (x|y) = P (X ≤ x|Y = y).
Note: Similarly, we can define f (y|x) and F (y|x).
21 / 65
Conditional Probability Distributions
Example 7: The joint and marginal pmfs of X and Y are given below.
Please find the conditional pmf of X given Y = 1.
f (x, y)
y=0
y=1
fX (x)
x=0
1/2
1/3
5/6
1/6
1/6
x=1
fY (y)
1/2
1/2
Solution: f (x|1) = ff(x,1)
for x = 0, 1.
Y (1)
1/3
f (1,1)
1/6
2
1
Explicitly, fX|Y (0|1) = ffY(0,1)
(1) = 1/2 = 3 , fX|Y (1|1) = fY (1) = 1/2 = 3 .
2/3 x = 0
So f (x|1) =
.
1/3 x = 1
Thus, (X|Y = 1) ∼ Bern(1/3).
22 / 65
Conditional Probability Distributions
If X and Y are jointly continuous random variables with joint pdf
f (x, y) and marginal pdf fX (x) and fY (y), then the conditional cdf of X
given Y = y is
F (x|y) = P (X ≤ x|Y = y).
For any y such that fY (y) > 0, it can be showed that
Z x
F (x|y) =
f (t, y)
dt.
−∞ fY (y)
Denote f (x|y) = ff(x,y)
and call it conditional pdf of X given Y = y.
Y (y)
Note: Similarly, we can define f (y|x) and F (y|x).
23 / 65
Conditional Probability Distributions
Example 8: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = 18 xy, 0 < x < 2y < 4. Find conditional pdf of X
given Y = y.
3
Solution: From Example 6, we know fY (y) = y4 , 0 < y < 2.
Thus,
f (x|y) =
f (x, y)
xy/8
x
= 3
= 2 , 0 < x < 2y < 4.
fY (y)
y /4
2y
24 / 65
Conditional Probability Distributions
The definition of conditional pmf or pdf can be generalized to
multivariate case.
The conditional pmf or pdf of Y1 given Y2 = y2 , · · · , Yn = yn is
f (y1 |y2 , · · · , yn ) =
f (y1 , · · · , yn )
.
f2,··· ,n (y2 , · · · , yn )
The joint conditional pmf or pdf of Y1 and Y3 given
Y2 = y2 , Y4 = y4 · · · , Yn = yn is
f (y1 , y3 |y2 , y4 , · · · , yn ) =
f (y1 , · · · , yn )
.
f2,4,··· ,n (y2 , y4 · · · , yn )
25 / 65
Independent Random
Variables
26 / 65
Independence
Let X have cdf FX (x), Y have cdf FY (y), and X and Y have joint cdf
F (x, y). Then X and Y are said to be independent if and only if
F (x, y) = FX (x)FY (y)
for every pair of real numbers (x, y).
If X and Y are not independent, they are said to be dependent.
27 / 65
Independence
Theorem 4:
Random variables X and Y are independent if and only if for all pairs of
real numbers (x, y),
f (x, y) = fX (x)fY (y),
Note:
For discrete random variables, f (x, y) is the joint pmf, fX (x) and fY (y)
are the marginal pmf.
For continuous random variables, f (x, y) is the joint pdf, fX (x) and
fY (y) are the marginal pdf.
28 / 65
Independence
If f (x|y) = f (x) or f (y|x) = f (y) for all pairs of real numbers (x, y),
then X and Y are independent.
If f (x|y) ̸= f (x) or f (y|x) ̸= f (y) for some pair of real numbers (x, y),
then X and Y are dependent.
Note:
For discrete random variables, f (x|y) and f (y|x) is the conditional pmf,
fX (x) and fY (y) are the marginal pmf.
For continuous random variables, f (x|y) and f (y|x) is the conditional
pdf, fX (x) and fY (y) are the marginal pdf.
29 / 65
Independence
The definition of independence can be generalized to n dimensions.
Suppose that we have n random variables, Y1 , · · · , Yn , where Yi has cdf
Fi (yi ), for i = 1, 2, · · · , n; and where Y1 , · · · , Yn have joint cdf
F (y1 , y2 , · · · , yn ).
Then Y1 , · · · , Yn are independent if and only if
F (y1 , y2 , · · · , yn ) = F1 (y1 ) · · · Fn (yn )
or equivalently f (y1 , y2 , · · · , yn ) = f1 (y1 )f2 (y2 ) · · · fn (yn ),
for all real numbers y1 , y2 , · · · , yn , where fi (yi ) is the marginal pmf or pdf
of Yi .
30 / 65
Independence
Example 9: Let f (x, y) = 6xy 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Show that X and Y
are independent.
Solution:
It is easy to get
R1
fX (x) = 0 6xy 2 dy = 2xy 3 |10 = 2x, 0 ≤ x ≤ 1
R1
fY (y) = 0 6xy 2 dx = 3x2 y 2 |10 = 3y 2 , 0 ≤ y ≤ 1.
Hence, f (x, y) = fX (x)fY (y) for all real numbers (x, y).
Therefore, X and Y are independent.
31 / 65
Expected Value and
Covariance
32 / 65
Expected Value of a Function of Multivariate R.V.
Let g(Y) = g(Y1 , · · · , Yn ) be a function of discrete multivariate random
variable Y = (Y1 , · · · , Yn )⊤ which has pmf f (y) = f (y1 , · · · , yn ).
Then the expected value of g(Y) is
E[g(Y)] =
X
X
g(y)f (y) =
all y
all yn
···
X
g(y1 , · · · , yn )f (y1 , · · · , yn )
all y1
If Y1 , · · · , Yn are continuous random variables with joint pdf
f (y1 , y2 , · · · , yn ), then
Z
E[g(Y)]
=
g(y)f (y)dy
Z ∞
Z ∞
=
···
g(y1 , · · · , yn )f (y1 , · · · , yn )dy1 · · · dyn .
−∞
−∞
33 / 65
Expected Value of a Function of Multivariate R.V.
Note:
Suppose Y1 , · · · , Yn are continuous random variables and we wish to find
the expected value of g(Y1 , · · · , Yn ) = Y1 , we have
Z ∞
E(Y1 )
=
=
Z ∞
···
y1 f (y1 , · · · , yn )dy1 · · · dyn
−∞
−∞
Z ∞
Z ∞
Z ∞
f (y1 , · · · , yn )dy2 · · · dyn dy1
···
y1
−∞
−∞
−∞
by definition of marginal pdf
Z ∞
=
y1 f1 (y1 )dy1
−∞
which agrees with the definition in univariate case.
34 / 65
Expected Value of a Function of Multivariate R.V.
Example 10: The joint and marginal pmfs of X and Y are given below.
Please find the expected value for XY, X, Y .
f (x, y)
y=0
y=1
fX (x)
x=0
1/2
1/3
5/6
1/6
1/6
x=1
fY (y)
1/2
1/2
Solution:
P
E(XY ) = all x,y xyf (x, y) = 0×f (0, 0)+0×f (0, 1)+1×f (1, 1) = 1/6,
P
E(X) = all x,y xf (x, y) = 0 × f (0, 0) + 0 × f (0, 1) + 1 × f (1, 1) = 1/6,
P
E(Y ) = all x,y yf (x, y) = 0 × f (0, 0) + 1 × f (0, 1) + 1 × f (1, 1) = 1/2.
35 / 65
Expected Value of a Function of Multivariate R.V.
Example 11: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = 81 xy, 0 < x < 2y < 4. Find the expected value of
XY , X and Y , respectively.
Solution:
R 2 R 2y
R2
3
1 6 2
32
E(XY ) = 0 0 xy 81 xydxdy = 18 0 y 2 x3 |2y
0 dy = 18 y |0 = 9 ,
R 2 R 2y
R2 3
1 5 2
32
E(X) = 0 0 x 18 xydxdy = 18 0 y x3 |2y
0 dy = 15 y |0 = 15 ,
R 2 R 2y 1
R
2
2
1 5 2
8
E(Y ) = 0 0 y 8 xydxdy = 18 0 y 2 x2 |2y
0 dy = 20 y |0 = 5 .
36 / 65
Properties of Expected Values
Theorem 5:
Let a, b be constants and let g(Y1 , · · · , Yn ), g1 (Y1 , · · · , Yn ),
g2 (Y1 , · · · , Yn ), · · · , gk (Y1 , · · · , Yn ) be functions of Y1 , · · · , Yn . Then the
following results hold:
1. E[ag(Y1 , · · · , Yn ) + b] = aE[g(Y1 , · · · , Yn )] + b.
2. E[g1 (Y1 , · · · , Yn ) + · · · + gk (Y1 , · · · , Yn )] =
E[g1 (Y1 , · · · , Yn )] + · · · + E[gk (Y1 , · · · , Yn )].
Proof: Similar to the proof of the expected values in univariate case.
37 / 65
Properties of Expected Values
Theorem 6: Let X and Y be independent random variables and g(X) and
h(Y ) be functions of only X and Y , respectively. Then
E[g(X)h(Y )] = E[g(X)]E[h(Y )].
Proof: Here we show the continuous case. The discrete case can be proved
similarly.
Z ∞Z ∞
E[g(X)h(Y )]
=
g(x)h(y)f (x, y)dxdy
−∞ −∞
Z ∞Z ∞
g(x)h(y)fX (x)fY (y)dxdy
(by independence)
Z ∞
g(x)fX (x)dx
h(y)fY (y)dy
=
−∞ −∞
Z ∞
=
−∞
−∞
= E[g(X)]E[h(Y )].
Note: Random variables X and Y could be multivariate.
38 / 65
Properties of Expected Values
Example 12: Let f (x, y) = 6xy 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Find the expected
value of XY .
Solution: We have showed that X and Y are independent and known
fX (x) = 2x, 0 ≤ x ≤ 1
fY (y) = 3y 2 , 0 ≤ y ≤ 1.
Then, it is easy to get E(X) =
R1
E(Y ) = 0 3y 3 dx = 3/4.
R1
0
2x2 dx = 2/3 and
Hence, E(XY ) = E(X)E(Y ) = 1/2.
39 / 65
Covariance
If X and Y are random variables with means µX and µY , respectively,
the covariance of X and Y is
Cov(X, Y ) = E[(X − µX )(Y − µY )].
Note:
Positive values indicate that X increases as Y increases; negative values
indicate that X decreases as Y increases.
A zero value of the covariance indicates that the variables are
uncorrelated and that there is no linear dependence between X and Y .
40 / 65
Correlation Coefficient
The value of variance depends on the scale of variables so it is difficult to
determine at first glance whether a particular covariance is large or small.
This problem can be eliminated by using the correlation coefficient, ρ, a
quantity related to the covariance and defined as
Note:
ρ=
Cov(X, Y )
Cov(X, Y )
= p
σX σY
V ar(X)V ar(Y )
The sign of ρ is same as the sign of the covariance and the range is
−1 ≤ ρ ≤ 1.
ρ = 1 implies perfect positive linear correlation, with all points falling on
a straight line with positive slope. ρ = −1 implies perfect negative linear
correlation, with all points falling on a straight line with negative slope.
ρ = 0 implies zero covariance and no correlation.
41 / 65
Covariance
Theorem 7: If X and Y are random variables with means µX and µY ,
respectively, then
Cov(X, Y ) = E[(X − µX )(Y − µY )] = E(XY ) − E(X)E(Y ).
Theorem 8: Let a, b be constants and X, Y be random variables. Then,
Cov(a + bX, c + dY ) = bdCov(X, Y ).
Theorem 9: If X and Y are independent random variables, then
Cov(X, Y ) = 0.
42 / 65
Covariance
Example 13: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = 81 xy, 0 < x < 2y < 4. Find the covariance of X
and Y .
Solution:
32
8
We have known that E(XY ) = 32
9 , E(X) = 15 and E(Y ) = 5 in
Example 11.
32 8
96
Hence Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 32
9 − 15 5 = 675 .
43 / 65
Expected Value of a Vector and Matrix
Let
Y1
.
Y = .. ,
Yn
X11
..
X= .
Xn1
···
..
.
···
X1k
..
.
.
Xnk
The expected value is fined to be
E(Y1 )
.
E(Y) = .. ,
E(Yn )
E(X11 ) · · ·
..
..
E(X) =
.
.
E(Xn1 ) · · ·
E(X1k )
..
.
.
E(Xnk )
44 / 65
Covariance Matrix
Y1
.
Let Y = .. ,
Yn
U1
.
U = .. . The covariance matrix is defined to be
Uk
Cov(Y1 , U1 ) · · ·
..
..
⊤
Cov(Y, U) = E (Y − E(Y))(U − E(U))
=
.
.
···
Cov(Yn , U1 )
Cov(Y) = E (Y − E(Y))(Y − E(Y))⊤ =
V ar(Y1 )
..
.
···
Cov(Yn , Y1 )
···
..
.
Cov(Y1 , Uk )
..
,
.
Cov(Yn , Uk )
Cov(Y1 , Yn )
..
.
.
V ar(Yn )
45 / 65
Expected Value and Variance of Multivariate R.V.
Theorem 10:
For any m × n constant matrix A, l × k constant matrix B, m × 1
constant vector c, l × 1 constant vector d, multivariate random variables
Y = (Y1 , · · · , Yn )⊤ and U = (U1 , · · · , Uk )⊤ , we have
E(AY + c) = AE(Y) + c,
Cov(AY + c, BU + d) = ACov(Y, U)B ⊤ ,
Cov(AY + c) = ACov(Y)A⊤ .
46 / 65
Expected Value and Variance of Multivariate R.V.
Corollary of Theorem 10:
Let Y1 , Y2 , · · · , Yn and X1 , X2 , · · · , Xk be random variables. Define
U1 =
n
X
i=1
ai Yi ,
and U2 =
k
X
bj Xj
j=1
for constants a1 , a2 , · · · , an and b1 , b2 , · · · , bk . Then the following hold:
Pn
(1) E(U1 ) = i=1 ai E(Yi ).
Pn
P
(2) V ar(U1 ) = i=1 a2i V ar(Yi ) + 2 1≤i 0 for
i = 1, 2, · · · , k. The random variables Y1 , Y2 , · · · , Yk , are said to have a
multinomial distribution with parameters n and p1 , p2 , · · · , pk if the joint
probability function of Y1 , Y2 , · · · , Yk is given by
f (y1 , y2 , · · · , yk ) =
n!
py1 py2 · · · pykk
y1 !y2 ! · · · yk ! 1 2
where for each i, yi = 0, 1, 2, · · · , n and
Pk
i=1 yi = n. We write
Y1 , Y2 , · · · , Yk ∼ M N (n, p1 , p2 , · · · , pk ).
Note:
Binomial distribution is a special case of the multinomial distribution
with k = 2.
57 / 65
Multinomial Distribution
Theorem 12:
If Y1 , Y2 , · · · , Yk have a multinomial distribution with parameters n and
p1 , p2 , · · · , pk , then
1. The marginal distribution of Yi is Bin(n, pi ) so that
E(Yi ) = npi , V ar(Yi ) = npi (1 − pi ).
2. Cov(Ys , Yt ) = −nps pt if s ̸= t.
Note:
Recall that Yi is the number of trials falling into cell i. Imagine all of the cells, excluding
cell i, combined into a single large cell. Then every trial will result in cell i or in a cell other
than cell i, with probabilities pi and 1 − pi , respectively. Thus, Yi ∼ Bin(n, pi ).
The covariance is negative, which is to be expected because a large number of outcomes in
cell s would force the number in cell t to be small.
58 / 65
Multinomial Distribution
Example 16: A fair die is rolled 10 times independently. What’s the
probability that three even numbers and two ones comp up? Please calculate
the correlation of the number of even numbers and the number of ones.
Solution: Define Y1 =“the number of even numbers”, Y2 =“the number of
ones”, Y3 =“the number of threes or fives” (i.e., other outcome).
Then, Y1 , Y2 , Y3 ∼ M N (10, 1/2, 1/6, 1/3). Then we have
10
1 3 1 2 1 5
P (Y1 = 3, Y2 = 2, Y3 = 5) = f (3, 2, 5) = 3!2!5!
= 0.0360
2
6
3
and the correlation is
Cov(Y1 , Y2 )
−np1 p2
−p1 p2
ρY1 ,Y2 = p
= p
= p
V ar(Y1 )V ar(Y2 )
np1 (1 − p1 )np2 (1 − p2 )
p1 (1 − p1 )p2 (1 − p2 )
= −q
1
× 61
2
1
× 12 × 16 × 56
2
1
= − √ = −0.4472.
5
59 / 65
Multivariate Normal
Distribution
60 / 65
Multivariate Normal Distribution
Univariate normal distribution: Y ∼ N (µ, σ 2 )
n
o
2
1
exp − (y−µ)
f (y) = √2πσ
2
2σ
Y1
µ1
σ12
∼ N µ = , Σ =
Y2
µ2
σ21
where E(Yi ) = µi , Cov(Y1 , Y2 ) = σ12 = σ21 = ρσ1 σ2 .
⊤
y − µ1
y − µ1
1
1 1
−1 1
Σ
f (y1 , y2 ) = 2π|Σ|1/2 exp − 2
y2 − µ2
y2 − µ2
o
n
1√
exp − Q
=
2 ,
2
Bivariate normal distribution:
2πσ1 σ2
σ12
σ22
,
1−ρ
where
Q=
1
(y1 − µ1 )2
(y1 − µ1 )(y2 − µ2 ) (y2 − µ2 )2
−
2ρ
+
.
1 − ρ2
σ12
σ1 σ2
σ22
61 / 65
Multivariate Normal Distribution
Multivariate normal distribution:
Y1
µ1
σ2
1
.
.
.
Y = .. ∼ N µ = .. , Σ = ..
Yn
µn
σn1
E(Yi ) = µi , Cov(Yi , Yj ) = σij = σji .
···
..
.
···
σ1n
..
, where
.
σn2
1
⊤ −1
exp − (y − µ) Σ (y − µ) .
f (y) = f (y1 , · · · , yn ) =
2
(2π)n/2 |Σ|1/2
1
62 / 65
Properties of Multivariate Normal Distribution
1. If Y ∼ N (µ, Σ), then E(Y) = µ, Cov(Y) = Σ.
2. If the correlation coefficient or covariance of Y1 ∼ N (µ1 , σ12 ) and
Y2 ∼ N (µ2 , σ22 ) are zero, Y1 and Y2 are independent.
3. For any k × n constant matrix A, k × 1 and constant vector b, if
Y ∼ N (µ, Σ), then
AY + b ∼ N (Aµ + b, AΣA⊤ ).
63 / 65
Properties of Multivariate Normal Distribution
4. Any finite dimensional random vector selected from Y has a multivariate
normal distribution.
For example,
Y1 ∼ N (µ1 , σ12 ),
Y1
Y2
Yt
∼ N
µ1
µ2
µt
σ12
,
σ21
σt2
∼ N ,
Ys
µs
σst
σ12
σ22
σts
σs2
,
when t ̸= s.
64 / 65
Conclusion
• Know how to obtain joint, marginal and conditional pmf and pdf.
• Be able to calculate expected value, covariance, correlation coefficient,
conditional expectation and be familiar with their properties.
• Understand multinomial distribution and multivariate normal
distribution and their properties.
65 / 65
Chapter 6: Functions of Random Variables
STAT6039 Principles of Mathematical Statistics
Functions of Discrete Random Variables
Example 1: A coin is tossed twice. Let Y be the number of heads that come
up. Find the distribution of X = 3Y − 1.
1/4,
Solution: We know Y ∼ Bin(2, 0.5) so its pmf is fY (y) = 1/2,
1/4,
If y = 0, then x = 3 × 0 − 1 = −1.
y = 0,
y = 1,
y = 2.
If y = 1, then x = 3 × 1 − 1 = 2.
If y = 2, then x = 3 × 2 − 1 = 5.
1/4, x = −1
So the pmf of X is fX (x) = 1/2, x = 2 (the same probabilities but
1/4, x = 5
different values).
1 / 42
Functions of Discrete Random Variables
Note that previous example is an one-to-one correspondence between x
and y values. This made the solution fairly easy. The following is a more
general result.
General Result:
Suppose that Y is a discrete random variable and X is a function of Y ,
denoted by X = g(Y ). Then X is a discrete random variable with pmf given
by
fX (x) =
X
fY (y).
y:g(y)=x
2 / 42
Functions of Discrete Random Variables
Example 2: Let Y ∼ Bin(2, 0.5). Find the pmf of U = (Y − 1)2 .
Solution:
The correspondence is as follows:
If y = 1, then u = 0.
If y = 0 or 2, then u = 1.
The pmf of U is
P
fU (0) = y:g(y)=0 fY (y) = fY (1) = 1/2,
P
fU (1) = y:g(y)=1 fY (y) = fY (0) + fY (2) = 1/4 + 1/4 = 1/2.
1/2 u = 0,
In summary, the pmf of U is fU (u) =
i.e.
1/2 u = 1,
U ∼ Bern(0.5).
3 / 42
Functions of Discrete Random Variables
Example 3: We roll two dice. Find the pmf of the absolute difference
between the two numbers that come up?
Solution:
Let Yi be the number which comes up on the ith die (i = 1, 2). We wish
to find the pmf of the absolute difference between Y1 and Y2 , namely
D = |Y1 − Y2 |.
So
fD (d) =
X
f (y1 , y2 )
y1 ,y2 :g(y1 ,y2 )=d
where f (y1 , y2 ) = 1/36, y1 , y2 ∈ {1, 2, · · · , 6}.
4 / 42
Functions of Discrete Random Variables
Solution (continued):
It is convenient to do this calculation graphically, as follows.
Hence, the pmf of D is fD (d) =
6/36
10/36
8/36
6/36
4/36
2/36
d = 0,
d = 1,
d = 2,
d = 3,
d = 4,
d = 5.
5 / 42
Functions of Continuous Random Variables
There are three main strategies for the continuous case.
• the cdf method
• the transformation method
• the mgf method.
6 / 42
The CDF Method
7 / 42
The CDF Method
Let U = g(Y1 , Y2 , · · · Yn ) be a function of the continuous random
variables Y1 , Y2 , · · · , Yn . The pdf of U can be found by the following steps.
1. Find the region U ≤ u in the (y1 , y2 , · · · , yn ) space.
2. Find FU (u) = P (U ≤ u) by integrating f (y1 , y2 , ..., yn ) over the
region U ≤ u.
3. Find the probability density function fU (u) by differentiating FU (u).
U (u)
Thus, fU (u) = dFdu
.
8 / 42
The CDF Method
Example 4: Suppose that Y ∼ U (0, 2). Find the pdf of X = 3Y − 1.
Solution:
FX (x) = P (X ≤ x) = P (3Y − 1 ≤ x) = P Y ≤ x+1
.
3
If x < −1, then (x + 1)/3 < 0 so that FX (x) = P Y ≤ x+1
= 0.
3
x+1
If x > 5, then (x + 1)/3 > 2 so that FX (x) = P Y ≤ 3 = 1.
R (x+1)/3 1
x+1
If −1 ≤ x ≤ 5, FX (x) = P Y ≤ x+1
= 0
3
2 dy = 6 .
0
x < −1,
Thus, fX (x) = 1/6 −1 ≤ x ≤ 5,
0
x > 5.
So X ∼ U (−1, 5).
9 / 42
The CDF Method
Example 5: Suppose that X, Y ∼ iid U (0, 1) . Find the pdf of U = X + Y .
Solution: Since X and Y are independent, the joint pdf of X and Y is
f (x, y) = fX (x)fY (y) = 1, 0 < x < 1, 0 < y < 1.
RR
So the cdf of U is FU (u) = P (X + Y ≤ u) =
1dxdy.
0 u) =
R1 R1
2
1 − u−1 u−y 1dxdy = − u2 + 2u − 1.
The pdf of U can be obtained by differentiating FU (u). Thus,
u,
0 < u < 1,
fU (u) = 2 − u, 1 < u < 2,
0,
otherwise.
11 / 42
The CDF Method
Example 6: Find the pdf of U = g(Y ) = Y 2 , where Y is a continuous
random variable with cdf FY (y) and pdf fY (y).
Solution:
If u ≤ 0, FU (u) = P (U ≤ u) = P (Y 2 ≤ u) = 0.
√
√
If u > 0, FU (u) = P (U ≤ u) = P (Y 2 ≤ u) = P (− u < Y < u) =
√
√
FY ( u) − FY (− u).
Differentiating with respect to u, we have if u > 0,
√
√
fU (u) = fY ( u) 2√1 u − fY (− u) − 2√1 u .
To summarize,
we get
√1 {fY (√u) + fY (−√u)} ,
2 u
fU (u) =
0,
u > 0,
otherwise.
12 / 42
The Transformation Method
13 / 42
The Transformation Method
Let U = g(Y ), where g(y) is an increasing function of y for all such that
fY (y) > 0. Then we have
FU (u) = P (U ≤ u) = P (g(Y ) ≤ u) = P (Y ≤ g −1 (u)) = FY [g −1 (u)].
Thus,
fU (u) = fY [g −1 (u)]
d[g −1 (u)]
.
du
If g(y) is a decreasing function of y for all such that fY (y) > 0, we have
FU (u) = P (U ≤ u) = P (g(Y ) ≤ u) = P (Y ≥ g −1 (u)) =
1 − FY [g −1 (u)].
Thus,
fU (u) = −fY [g −1 (u)]
d[g −1 (u)]
.
du
14 / 42
The Transformation Method
Let U = g(Y ), where g(y) is either an increasing or decreasing function
of y for all such that fY (y) > 0.
1. Find the inverse function, y = g −1 (u).
−1
2. Evaluate d[g du(u)] .
3. Find fU (u) by
fU (u) = fY [g −1 (u)]
d[g −1 (u)]
.
du
Note: It is a “shortcut version” of the cdf method.
15 / 42
The Transformation Method
Example 7: Suppose that Y ∼ U (0, 2). Find the pdf of X = 3Y − 1.
Solution: Since Y ranges from 0 to 2, X = 3Y − 1 ranges from -1 to -5.
As X = g(Y ) = 3Y − 1, g(y) is an increasing function of y for all
0 < y < 2.
−1
d[g (x)]
It is easy to get y = g −1 (x) = x+1
= 13 .
3 and
dx
Since fY (y) = 21 , we have
fX (x) = fY (g −1 (x))
dg −1 (x)
1 1
1
=
= , −1 < x < 5.
dx
2 3
6
So X ∼ U (−1, 5).
16 / 42
The Transformation Method
Example 8: Suppose that Y ∼ N (µ, σ 2 ). Find the pdf of Z = Y σ−µ .
Solution:
Since Z = Y σ−µ , we know z = g(y) = y−µ
σ is an increasing function of y
for all −∞ < y < ∞.
−1
Also it is easy to get y = g −1 (z) = σz + µ and d[g dz(z)] = σ.
(y−µ)2
1
Since fY (y) = √2πσ
e− 2σ2 , −∞ < y < ∞, we have
fZ (z)
= fY (g −1 (z))
=
=
dg −1 (z)
dz
(σz+µ−µ)2
1
e− 2σ2
× |σ|
2πσ
z2
1
√ e− 2 , −∞ < z < ∞.
2π
√
So Z ∼ N (0, 1).
17 / 42
The MGF Method
18 / 42
The MGF Method
Theorem 1 (Uniqueness of MGF):
Let mX (t) and mY (t) denote the moment generating functions of
random variables X and Y , respectively. If both moment generating
functions exist and mX (t) = mY (t) for all values of t, then X and Y have
the same probability distribution.
Note: The proof is beyond the scope of this course.
19 / 42
The MGF Method
Let U be a function of the random variables Y1 , Y2 , · · · , Yn .
1. Find the moment generating function of U , mU (t) = E(etU ).
2. Compare mU (t) with other well-known moment generating functions.
If mU (t) = mV (t) for all values of t, Theorem 1 implies that U and V have
identical distributions.
20 / 42
The MGF Method
Example 9: Find the probability distribution of U = Z 2 , where
Z ∼ N (0, 1).
Solution:
mU (t)
tU
tZ 2
Z ∞
2
tz 2 e
− z2
= E(e ) = E(e ) =
e √ dz
2π
−∞
Z ∞
1 − z2 (1−2t)
√ e 2
=
dz
2π
−∞
Z ∞
z2
1
√ e− 2σ2 dz
=
where σ 2 = (1 − 2t)−1 , t < 1/2
2π
−∞
Z ∞
z2
1
√
= σ
e− 2σ2 dz
2πσ
−∞
= σ
(since the last integral equals 1).
Thus, mU (t) = (1 − 2t)−1/2 . Comparing this mgf with those well-known
mgfs, we can get U ∼ χ2 (1), i.e. G(α = 1/2, β = 2).
21 / 42
The MGF Method
Theorem 2 :
1. If X = a + bY , then
mX (t) = eat mY (bt).
2. Let Y1 , Y2 , · · · , Yn be independent random variables and
U = Y1 + Y2 + · · · + Yn , then
mU (t) =
n
Y
mYi (t) = mY1 (t) × mY2 (t) × · · · × mYn (t).
i=1
22 / 42
The MGF Method
Example 10: Find the probability distribution of Y = µ + σZ, where
Z ∼ N (0, 1).
1 2
Solution: Since Z ∼ N (0, 1), we have mZ (t) = e 2 t .
So
mY (t)
= eµt mZ (σt)
1
2
1
2 2
= eµt e 2 (σt)
= eµt+ 2 σ t .
This mgf mY (t) is just the mgf of N (µ, σ 2 ).
Hence, Y ∼ N (µ, σ 2 ).
23 / 42
The MGF Method
Example 11: Suppose that Y1 , Y2 , · · · , Yn are independent normally
distributed random variables with parameters µi and σi2 , respectively. Find
Pn
the distribution of X = i=1 ai Yi (a linear combination).
Solution:
1
2 2
For each i = 1, · · · , n, the mgf function of Yi is mYi (t) = eµi t+ 2 σi t .
Then,
mX (t)
=
ma1 Y1 (t)ma2 Y2 (t) · · · man Yn (t) = mY1 (a1 t)mY2 (a2 t) · · · mYn (an t)
1
2
2
1
2
2
1
2
2
= ea1 µ1 t+ 2 σ1 (a1 t) ea2 µ2 t+ 2 σ2 (a2 t) · · · ean µn t+ 2 σn (an t)
Pn
= e(
Pn
1
i=1 ai µi )t+ 2 (
2 2 2
i=1 ai σi )t
.
Pn
Pn
Therefore, X ∼ N ( i=1 ai µi , i=1 a2i σi2 ).
24 / 42
The MGF Method
Example 12: Suppose that Y1 , Y2 , · · · , Yn are independent gamma
distribution variables with parameters αi and β, respectively. Find the
Pn
distribution of X = i=1 Yi = Y1 + Y2 + · · · + Yn .
Solution:
For each i = 1, · · · , n, the mgf function of Yi is mYi (t) = (1 − βt)−αi .
Then,
mX (t)
= mY1 (t) × mY2 (t) × · · · × mYn (t)
=
(1 − βt)−α1 (1 − βt)−α2 · · · (1 − βt)−αn
=
(1 − βt)−
Therefore, X ∼ G(
Pn
i=1 αi
.
Pn
i=1 αi , β).
25 / 42
The MGF Method
Important Properties:
1. If Yi ∼ χ2 (ni ), i = 1, · · · , n and all Yi s are independent, then
Pn
Pn
2
2
i=1 ni ). For example, if Yi ∼ iid χ (1), i = 1, · · · , n,
i=1 Yi ∼ χ (
Pn
then i=1 Yi ∼ χ2 (n).
2. If Yi ∼ iid Exp(β), then
Pn
i=1 Yi ∼ G(n, β).
3. Let Yi ∼ N (µi , σi2 ), i = 1, · · · , n and assume all Yi s are independent.
Define
Zi =
Then
Yi − µi
, i = 1, · · · , n.
σi
Pn
2
2
i=1 Zi ∼ χ (n).
26 / 42
Order Statistics
27 / 42
Order Statistics
Let Y1 , Y2 , · · · , Yn denote independent continuous random variables with
common cdf F (y) and pdf f (y).
We denote the ordered random variables of {Yi , i = 1, · · · , n} by
Y(1) , Y(2) , · · · , Y(n) , where Y(1) ≤ Y(2) ≤ · · · ≤ Y(n) . (Because the random
variables are continuous, the equality signs can be ignored.) Using this
notation,
Y(1) = min(Y1 , Y2 , · · · , Yn ) is the minimum of the random variables
{Yi , i = 1, · · · , n},
and
Y(n) = max(Y1 , Y2 , · · · , Yn ) is the maximum of the random variables
{Yi , i = 1, · · · , n}.
We call Y(k) the kth order statistic.
28 / 42
Order Statistics
The pdf of Y(1) and the pdf of Y(n) can be found using the cdf method.
FY(n) (y) = P (Y(n) ≤ y)
FY(1) (y) = 1 − P (Y(1) > y)
=
P (Y1 ≤ y, · · · , Yn ≤ y)
=
P (Y1 ≤ y)P (Y2 ≤ y) · · · P (Yn ≤ y)
=
[F (y)]n
=
1 − P (Y1 > y, · · · , Yn > y)
=
1 − P (Y1 > y)P (Y2 > y) · · · P (Yn > y)
=
1 − [1 − F (y)]n
The pdf of Y(n) is fY(n) (y) = n[F (y)]n−1 f (y).
The pdf of Y(1) is fY(1) (y) = n[1 − F (y)]n−1 f (y).
29 / 42
Order Statistics
Let us now consider the case n = 2. So we only have two random
variables Y1 and Y2 . We would like to find the joint pdf of Y(1) and Y(2) . For
any y1 ≤ y2 ,
FY(1) ,Y(2) (y1 , y2 )
=
P [(Y1 ≤ y1 , Y2 ≤ y2 ) ∪ (Y2 ≤ y1 , Y1 ≤ y2 )]
=
P (Y1 ≤ y1 , Y2 ≤ y2 ) + P [(Y2 ≤ y1 , Y≤ y2 ) − P [(Y1 ≤ y1 , Y2 ≤ y1 )
=
F (y1 )F (y2 ) + F (y2 )F (y1 ) − F (y1 )F (y1 )
=
2F (y1 )F (y2 ) − [F (y1 )]2 .
The joint pdf of Y(1) and Y(2) can be obtained by differentiating
FY(1) ,Y(2) (y1 , y2 ) first with respect to y2 and then with respect to y1 , which is
fY(1) ,Y(2) (y1 , y2 ) = 2f (y1 )f (y2 ),
y1 ≤ y2 .
30 / 42
Order Statistics
Theorem 3: Let Y1 , . . . , Yn be independent identically distributed
continuous random variables with cdf F (y) and pdf f (y). Then the pdf of
Y(k) is given by
n!
fY(k) (yk ) = (k−1)!(n−k)!
[F (yk )]k−1 [1 − F (yk )]n−k f (yk ) , −∞ < yk < ∞ .
And the joint pdf of Y(j) and Y(k) (0 ≤ j < k ≤ n) is
fY(j) ,Y(k) (yj , yk ) =
n!
[F (yj )]j−1
(j − 1)!(k − 1 − j)!(n − k)!
× [F (yk ) − F (yj )]k−1−j × [1 − F (yk )]n−k f (yj ) f (yk ) ,
− ∞ < yj < yk < ∞.
The joint pdf of Y(1) , . . . , Y(n) is
fY(1) ,··· ,Y(n) (y1 , · · · , yn ) = n!f (y1 ) · · · f (yn ),
−∞ < y1 < · · · < yn < ∞.
31 / 42
Order Statistics
Example 13: Electronic components of a certain type have a length of life
Y ∼ Exp(100), measured in hours. Suppose that two components operate
independently and in series in a certain system (hence, the system fails when
either component fails). Find the probability distribution of X, the length of
life of the system.
Solution: Because the system fails at the first component failure,
X = min(Y1 , Y2 ), where Y1 and Y2 are independent random variables with
the same pdf f (y) = (1/100)e−y/100 and cdf F (y) = 1 − e−y/100 , y > 0.
Then,
fX (y) = fY(1) (y)
= n[1 − F (y)]n−1 f (y)
=
2e−y/100 (1/100)e−y/100
=
(1/50)e−y/50 ,
y > 0.
Hence, X ∼ Exp(50).
32 / 42
Order Statistics
Example 14: Suppose that the components in Example 13 operate in
parallel (hence, the system does not fail until both components fail). Find the
pdf of X, the length of life of the system.
Solution: Now X = max(Y1 , Y2 ). Then
fX (y) = fY(2) (y)
= n[F (y)]n−1 f (y)
=
2(1 − e−y/100 )(1/100)e−y/100
=
(1/50)(e−y/100 − e−y/50 ),
y > 0.
Hence, the maximum of two exponential random variables is not an
exponential random variable.
33 / 42
Order Statistics
Example 15: Suppose that Y1 , · · · , Y5 ∼ iid U (0, 1). Find the pdf of Y(2) .
Also, give the joint pdf of Y(2) and Y(4) .
Solution: Since Y1 , · · · , Y5 ∼ iid U (0, 1), we have f (y) = 1, 0 < y < 1
and F (y) = y, 0 < y < 1.
The pdf of Y(2) can be obtained directly from Theorem 3 with
n = 5, k = 2. So
fY(k) (yk )
5!
2−1
5−2
[F (y2 )]
[1 − F (y2 )]
f (y2 )
(2 − 1)!(5 − 2)!
= 20y2 (1 − y2 )3 , 0 < y2 < 1.
=
Hence, Y(2) ∼ Beta(2, 4).
In general, the kth-order statistic of Y1 , · · · , Yn ∼ iid U (0, 1) has a beta
distribution with α = k and β = n − k + 1.
34 / 42
Order Statistics
Solution (continued):
The joint pdf of Y(2) and Y(4) can also be obtained from Theorem 3 with
n = 5, j = 2, k = 4. So it has the form of
fY(2) ,Y(4) (y2 , y4 )
=
5!
2−1
[F (y2 )]
(2 − 1)!(4 − 1 − 2)!(5 − 4)!
4−1−2
=
5−4
× [F (y4 ) − F (y2 )]
× [1 − F (y4 )]
120y2 (y4 − y2 )(1 − y4 ),
0 < y2 < y4 < 1.
f (y2 ) f (y4 )
This joint density can be used to evaluate joint probabilities about Y(2) and
Y(4) or to evaluate the expected value of functions of these two variables.
35 / 42
Range Restricted Distributions
36 / 42
Range Restricted Distributions
Suppose we have a random variable of Y . Some restrictions are put on the
range of Y , then the new random variable X has a range restricted
distribution.
Example 16: Suppose that the number of accidents which occur each year at
a certain intersection follows a Poisson distribution with mean λ. Find the
pmf and expectation of the number of accidents at this intersection last year
if it is known that at least one accident occurred at the intersection during
that year.
37 / 42
Range Restricted Distributions
Solution: Let Y be the number of accidents at the intersection last year so
that Y ∼ P oi(λ). Then the random variable of interest is X = (Y |Y > 0),
and the pmf of X is
fX (x)
= P (X = x) = P (Y = x|Y > 0)
P (Y = x, Y > 0)
=
P (Y > 0)
P (Y = x)
=
, x = 1, 2, · · ·
1 − P (Y = 0)
λx e−λ /x!
=
, x = 1, 2, · · ·
1 − e−λ
Also, it is easy to know 0 = fX (0) < fY (0) = e−λ . Since 1 − e−λ < 1,
so fX (x) > fY (x), x = 1, 2, · · · .
38 / 42
Range Restricted Distributions
Solution (continued):
E(Y |Y > 0)
∞
X
λx e−λ /x!
x
1 − e−λ
x=1
=
E(X) =
=
∞
X
1
λx e−λ
x
1 − e−λ x=0
x!
=
=
(where the first term in the sum is 0)
1
E(Y )
1 − e−λ
λ
.
1 − e−λ
So E(X) > E(Y ).
39 / 42
Range Restricted Distributions
Example 17: Y ∼ P oi(λ). Find the pmf of X = (Y |Y > 1).
Solution: Using the same logic as in Example 16, we get
fX (x) =
λx e−λ /x!
,
1 − e−λ − λe−λ
x = 2, 3 · · ·
Example 18: Y ∼ P oi(λ). Find the
pmf of X = Y I(Y
> 1).
Solution: Here X = Y × I(Y > 1) =
Y × 0 if Y ≤ 1
Y × 1 if Y > 1
=
0
if Y ≤ 1,
Y
if Y > 1.
≤ 1) = e−λ + λe−λ , we have
Since P
(X = 0) = P (Y
e−λ + λe−λ x = 0,
fX (x) =
λx e−λ
x = 2, 3, · · · .
x!
Note: These two kinds of range restrictions here are different.
40 / 42
Range Restricted Distributions
Example 19: Z ∼ N (0, 1). Find the pdf of X = (Z|Z > 0).
1
2
Z (x)
Solution: fX (x) = Pf(Z>0)
= 2ϕ(x) = √22π e− 2 x , x > 0.
Example 20: Z ∼ N (0, 1). Find thepdf of X = ZI(Z> 0).
Solution: Here X = Z × I(Z > 0) =
Z × 0 if Z ≤ 0
Z × 1 if Z > 0
=
0
if Z ≤ 0,
Z
if Z > 0.
Since P (Z≤ 0) = 1/2, we have
1/2
x = 0 (discrete),
fX (x) =
2
ϕ(x) = √1 e− 12 x x > 0 (continuous).
2π
So X here has a mixed distribution.
Note: We could use notation X = max(0, Z) in Example 20.
41 / 42
Conclusion
• Know when and how to use those three methods to derive the
distribution of functions of random variables.
• Can obtain the marginal or joint distribution of order statistics by using
the formulae.
• Be able to get the range restricted distributions.
• Be familiar with those important results, like properties of gamma
distribution, normal distributions, chi-square distributions and so on.
42 / 42
Chapter 7: Sampling Distributions and
the Central Limit Theorem
STAT6039 Principles of Mathematical Statistics
Population and Random Sample
Population:
• Every observation of interest available in the physical world.
• In most cases, we are interested in some unknown population
parameters, e.g., population mean µ, population variance σ 2 , etc.
Random Sample:
• A selection of observations drawn randomly from the population of
interest.
• Denoted as Y1 , · · · , Yn . Assume they are independent and identically
distributed (iid) random variables through this course.
• We can use sample mean Ȳ =
2
S =
Pn
i=1 Yi
n
2
2
i=1 (Yi −Ȳ )
to
estimate
σ
.
n−1
to estimate µ, sample variance
Pn
1 / 45
Statistics
A statistic is a function of the observable random variables in a sample
and known constants.
• Sample mean Ȳ is a function of the random variables Y1 , · · · , Yn and
the (constant) sample size n so it is a statistic.
• Other examples: sample variance S 2 =
Pn
2
i=1 (Yi −Ȳ )
n−1
, order statistics
Y(n) = max(Y1 , · · · , Yn ), Y(1) = min(Y1 , · · · , Yn ), range Y(n) − Y(1) ,
etc.
•
Pn
2
i=1 (Yi − µ) is not a statistic since there is an unknown parameter µ.
2 / 45
Sampling Distributions
Normally we use these statistics to make inferences about population
parameters. All statistics are functions of the random variables observed in a
sample. It means if we take a new sample, we would probably get a different
value of statistics. Therefore, all statistics are random variables. There is
variability associated with them.
Consequently, all statistics have probability distributions, which we will
call their sampling distributions. Knowing the sampling distribution of a
statistic can tell us how they differ from sample to sample and how
accurately we are estimating a population parameter.
3 / 45
Sampling Distribution
Example 1: Suppose we roll a fair four-sided die and let Y be the number
that comes up. Let’s repeatedly draw samples of size 2, i.e., roll the die
twice. Find the sampling distribution of the sample mean Ȳ .
Solution: First, the following table lists the pmf of Y .
y
1
2
3
4
f (y)
1
4
1
4
1
4
1
4
We can calculate the population mean and population variance from the
probability distribution:
µ = E(Y ) = 52 and σ 2 = V ar(Y ) = 54 .
4 / 45
Sampling Distribution
Solution (continued): Then we calculate the sample mean for all possible
samples of size 2:
Second roll
1
2
3
4
1
ȳ = 1
ȳ = 1.5
ȳ = 2
ȳ = 2.5
First
2
ȳ = 1.5
ȳ = 2
ȳ = 2.5
ȳ = 3
roll
3
ȳ = 2
ȳ = 2.5
ȳ = 3
ȳ = 3.5
4
ȳ = 2.5
ȳ = 3
ȳ = 3.5
ȳ = 4
Note that each possible sample of size 2 has equal probability of being
drawn.
5 / 45
Sampling Distribution
Solution (continued): We can derive the sampling distribution of Ȳ from
the previous table:
ȳ
1
1.5
2
2.5
3
3.5
4
f (ȳ)
1
16
2
16
3
16
4
16
3
16
2
16
1
16
Because we could completely list all possible samples of size 2, this
sampling distribution could be determined exactly.
2
We can also get E(Ȳ ) = 25 = µ and V ar(Ȳ ) = 58 = σ2 .
6 / 45
Sampling Distribution
What happens when we can’t list all possible samples of a certain size or
the number of possible samples of a certain size is too large?
One approach is listed as follows.
• Randomly draw a sample of size n and calculate the sample statistic.
• Repeatedly conduct the above step m times and we can get m sample
statistics in total. For example, {Ȳ (k) , k = 1, · · · , m}.
• Draw a histogram of m sample statistics and the relative frequency can
be used to approximate the sampling distribution.
https://college.cengage.com/nextbook/statistics/
wackerly_966371/student/html/
7 / 45
Sampling Distribution
What we find from the experiment can be summarized below
• As sample size n increases, the histogram of all the observations in a
sample begins to look more like the distribution of the original
population.
• The sample mean of the sample means is very close to the population
mean µ and the sample variance of the sample means gets smaller as
the sample size n increases .
• As sample size n is increasing, the histogram of the sample means
looks more like a normal distribution. This phenomenon is due to the
Central Limit Theorem.
8 / 45
Properties of Sample Mean
Theorem 1: Suppose Y1 , · · · , Yn are iid with mean µ and finite variance σ 2 .
Then,
µȲ = E(Ȳ ) = µ
and
σȲ2 = V ar(Ȳ ) =
σ2
.
n
Proof:
Y1 + · · · + Yn
E(Y1 )
E(Yn )
µȲ = E(Ȳ ) = E
=
+ ··· +
= µ.
n
n
n
Y1 + · · · + Yn
2
σȲ = V ar(Ȳ ) = V ar
n
V ar(Y1 )
V ar(Yn )
=
+ ··· +
(by independence)
n2
n2
2
2
2
σ
σ
σ
=
+ ··· + 2 =
.
2
n
n
n
Note: More data we have, more accurate we are estimating the true
population mean µ by the sample mean Ȳ .
9 / 45
Sampling Distributions
Related to
Normal Distribution
10 / 45
Sampling Distribution of Sample Mean (known σ)
Theorem 2: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Then,
Pn
√
σ2
n(Ȳ − µ)
i=1 Yi
∼ N µ,
, i.e.
∼ N (0, 1).
Ȳ =
n
n
σ
Proof: We know Ȳ =
Pn
i=1 Yi
is a linear combination of independent
n
normally distributed random variables Y1 , · · · , Yn with weights being n1 .
According to the result of Example 11 in Chapter 6, we get Ȳ follows a
normal distribution with mean
Y1 + · · · + Yn
=µ
µȲ = E(Ȳ ) = E
n
Y1 + · · · + Yn
σ2
2
=
.
and variance σȲ = V ar(Ȳ ) = V ar
n
n
Thus,
Ȳ ∼ N
µ,
σ2
n
√
,
i.e.
n(Ȳ − µ)
∼ N (0, 1).
σ
11 / 45
Sampling Distribution of Sample Mean (known σ)
Example 2: A bottling machine can be regulated so that it discharges an
average of µ ml per bottle. It has been observed that the amount of fill
dispensed by the machine is normally distributed with σ = 1ml. A random
sample of n = 9 filled bottles is selected from the output of the machine.
Find the probability that the sample mean will be within 0.3ml of the true
mean µ for the chosen machine setting.
Solution: Let Yi be the volume of the ith bottle in the sample, i = 1, · · · , n,
where n = 9. Then Y1 , · · · , Yn ∼ iid N (µ, σ 2 ) where σ 2 = 1 and µ is
unknown.
So we have Ȳ ∼ N (µ, σ 2 /9).
12 / 45
Sampling Distribution of Sample Mean (known σ)
Solution (continued): Hence
P (|Ȳ − µ| ≤ 0.3)
=
0.3
Ȳ − µ
0.3
√ ≤
√
P − √ ≤
σ/ n
σ/ n
σ/ n
0.3
0.3
P − √ ≤Z≤ √
1/ 9
1/ 9
P (−0.9 ≤ Z ≤ 0.9)
=
1 − 2P (Z > 0.9)
=
1 − 2 × 0.1841
=
0.6318.
=
=
Note: This calculation does not depend on µ.
13 / 45
Sampling Distribution of Sample Mean (known σ)
Example 3: Refer to Example 2. How large n should be if we wish Ȳ to be
within 0.3ml of µ with probability 0.95?
Solution: Now we want
0.95 = P (|Ȳ − µ| ≤ 0.3)
0.3
Ȳ − µ
0.3
√ ≤
√
= P − √ ≤
σ/ n
σ/ n
σ/ n
√
√
= P −0.3 n ≤ Z ≤ 0.3 n
√
= 1 − 2P (Z > 0.3 n)
√
Thus, we need find n such that P (Z > 0.3 n) = 0.025. From the
z-table, we know P (Z > 1.96) = 0.025. So n = (1.96/0.3)2 = 42.68.
However, it is impossible to take a sample of size 42.68. Our solution
indicates that a sample of size 42 is not quite large enough to reach our
objective. So n should be 43 and P (|Y − µ| ≤ 0.3) will slightly exceed
0.95.
14 / 45
Sampling Distribution of Sample Variance
Theorem 3: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Define the sample
variance
2
Pn
S =
2
i=1 (Yi − Ȳ )
n−1
.
Then,
(a)
(n − 1)S 2
∼ χ2 (n − 1)
σ2
(b)
S 2 and Ȳ are independent.
Proof: The proof of this theorem is beyond the scope of this course.
15 / 45
Chi-square Table
Table 6 in the “statistical table” file lists the values of χ2α (m) such that
P (χ2 (m) > χ2α (m)) = α
for different α and degrees of freedom m.
So χ2α (m) is the upper α quantile of χ2 (m) or (lower) 1 − α quantile of
χ2 (m) since P (χ2 (m) ≤ χ2α (m)) = 1 − α.
16 / 45
Sampling Distribution of Sample Variance
Example 4: Refer to Example 2. If those 9 observations are used to
calculate S 2 , it might be useful to specify an interval of values that will
include S 2 with a high probability. Find numbers a and b such that
P (a < S 2 < b) = 0.9.
Solution: By theorem 3, we get
= P (a < S 2 < b)
(n − 1)a
(n − 1)S 2
(n − 1)b
= P
<
<
σ2
σ2
σ2
(n − 1)S 2
= P (8a < U < 8b), where U =
∼ χ2 (8).
σ2
One method of doing this is to find the value of 8b that cuts off an area of
0.9
0.05 in the upper tail and the value of 8a that cuts off 0.05 in the lower tail
(0.95 in the upper tail).
17 / 45
Sampling Distribution of Sample Variance
Solution (continued):
Therefore, 8b = χ20.05 (8) = 15.5073 and 8a = χ20.95 (8) = 2.73264.
Then, we get a = 0.34158 and b = 1.93841. So the required interval is
about (0.3416, 1.9384).
Note: The required interval is not unique. Another interval is obtained by
0.9 = P (0 < U < 13.3616), where 13.3616 is χ20.1 (8). So we equate 13.3616 = 8b to get
(0,1.6702).
Or 0.9 = P (3.48954 < U < ∞) , where 3.48954 is χ20.9 (8). So we equate
3.48954 = 8a to get (0.4362, ∞) .
The ideal case is to get a shortest interval such that P (a < s2 < b) = 0.9. However, the
values of a and b cannot be obtained analytically and must be calculated numerically.
Therefore, we prefer to use the interval that cuts off an area of 0.05 in both upper and lower
tails.
18 / 45
Sampling Distribution of Sample Mean (unknown σ)
√
Theorem 2 tells us that
n(Ȳ −µ)
∼ N (0, 1) so it provides the basis for
σ
development of inference-making procedures about the mean µ of a normal
population with known variance σ 2 .
What if σ is unknown?
It can be estimated by sample standard deviation S =
√
√
S 2 . So we have
n(Ȳ − µ)
,
S
which can provide the basis for developing methods for inferences about µ if
we know the distribution of this random variable.
√
We can show that
n(Ȳ −µ)
has a distribution known as Student’s t
S
distribution with n − 1 degrees of freedom.
19 / 45
Student’s t Distribution
Suppose Z ∼ N (0, 1) and U ∼ χ2 (m). Then, if Z and U are
independent,
Z
T =p
U/m
is said to have a t distribution with m df. The pdf of T is
− m+1
2
Γ m+1
y2
2
1+
,
f (y) = √
m
m
πmΓ 2
−∞ < y < ∞.
We may write T ∼ t(m) or T ∼ tm and f (y) as ft(m) (y).
Pdf f (y) is symmetric about zero.
m
If m > 1, E(T ) = 0. If m > 2, V ar(T ) = m−2
.
20 / 45
Student’s t Distribution
The pdfs of N (0, 1) and t(m) are sketched in the following figure. Notice
that both density functions are symmetric about the origin but that the t
density has more probability mass in its tails or we say it has “fatter tails”
than the standard normal distribution.
When m → ∞, t(m) converges to N (0, 1).
21 / 45
Student’s t Distribution
Table 5 in the “statistical table” file lists the upper α quantiles tα (m) (i.e.,
1 − α quantiles) of t(m) for different α and degrees of freedom m, which
implies
P (T > tα (m)) = α,
i.e.,
P (T ≤ tα (m)) = 1 − α.
For example, t0.05 (8) = 1.860. Let T ∼ t(8). we have
P (T > 1.860) = 0.05, P (T ≤ 1.860) = 0.95, P (T ≤ −1.860) = 0.05 and
P (−1.860 < T < 1.860) = 0.90.
22 / 45
Sampling Distribution of Sample Mean (unknown σ)
Theorem 4: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Then,
√
n(Ȳ − µ)
∼ t(n − 1).
T =
S
√
Proof: From theorem 2, we get Z =
n(Ȳ −µ)
∼ N (0, 1).
σ
2
By theorem 3, we have U = (n−1)S
∼ χ2 (n − 1). Also S 2 and Ȳ are
σ2
independent so that Z and U are independent. By the definition of the t
Z
∼ t(n − 1).
U/(n−1)
√
√
n(Ȳ −µ)
Z
n(Ȳ − µ)
σ
p
= T,
=q
=
2
(n−1)S
S
U/(n − 1)
(n−1)σ 2
distribution, we know √
Since
√
we have T =
n(Ȳ −µ)
∼ t(n − 1).
S
23 / 45
Sampling Distribution of Sample Mean (unknown σ)
Example 5: The tensile strength for a type of wire is normally distributed
with unknown mean µ and unknown variance σ 2 . Six pieces of wire were
randomly selected. Denote Yi to be the tensile strength for piece i. Because
σȲ2 = σ 2 /n, so that σȲ2 can be estimated by S 2 /n , the estimated variance of
√
µ. Find the approximate probability that Ȳ will be within 2S/ n of the true
population mean µ.
Solution: We want to find
P
2S
2S
− √ ≤ (Ȳ − µ) ≤ √
n
n
√
=P
−2 ≤
n(Ȳ − µ)
≤2
S
= P (−2 ≤ T ≤ 2) = 1 − 2P (T > 2)
where T ∼ t(n − 1) = t(5).
24 / 45
Sampling Distribution of Sample Mean (unknown σ)
Solution (continued):
Looking at the t-table, we see that P (T > 2.015) = 0.05. So the
probability that Ȳ will be within 2 estimated standard deviations of µ is
slightly less than 0.90.
If using R, we can get exact probability of interest which is 0.8981 by
command 1-2*pt(-2, df=5).
Notice that, if σ 2 were known, the probability that Ȳ will fall within 2σȲ
of µ would be given by
√
2σ
2σ
n(Ȳ − µ)
P − √ ≤ (Ȳ − µ) ≤ √
= P −2 ≤
≤2
σ
n
n
= P (−2 ≤ Z ≤ 2) = 0.9544.
25 / 45
Comparing Variances of Two Normal Distributions
2
Suppose we have two independent normal populations X ∼ N (µX , σX
)
and Y ∼ N (µY , σY2 ) and all the parameters are unknown. We are interested
2
in comparing of σX
and σY2 .
We can randomly select a sample from each population. They are
2
) (sample from population X)
X1 , · · · , Xn ∼ iid N (µX , σX
Y1 , · · · , Ym ∼ iid N (µY , σY2 ) (sample from population Y )
2
It seems intuitive that the ratio SX
/SY2 could be used to make inferences
2
2
about the relative magnitudes of σX
and σY2 , where SX
and SY2 are sample
variances of these two samples.
2
What are the distribution related to SX
/SY2 ?
26 / 45
F Distribution
Let W1 ∼ χ2 (m1 ) and W2 ∼ χ2 (m2 ) and they are independent. Then,
F =
W1 /m1
W2 /m2
is said to have an F distribution with m1 numerator degrees of freedom and
m2 denominator degrees of freedom. The pdf is
f (y) =
m1 +m2
m1
m2
m1
m1 +m2
2 m2 m1 2 m2 2 y 2 −1 (m1 + m2 y) 2 , y > 0
m1
Γ 2
2
Γ
Γ
We may write F ∼ F (m1 , m2 ) and f (y) as fF (m1 ,m2 ) (y).
2m2 (m +m −2)
1
2
2
2
If m2 > 2, E(F ) = mm
. If m2 > 4, V ar(F ) = m1 (m−2)
2 (m t−4) .
2 −2
2
A basic fact about the F distribution is that if F ∼ F (m1 , m2 ), then
1/F ∼ F (m2 , m1 ).
27 / 45
F Distribution
The pdf of the F distribution looks like a gamma pdf.
Table 7 in the “statistical table” file tabulates the upper α quantiles
Fα (m1 , m2 ) (i.e., 1 − α quantiles) of F (m1 , m2 for different α and degrees
of freedom m1 , m2 , which implies
P (F > Fα (m1 , m2 )) = α,
i.e.,
P (F ≤ Fα (m1 , m2 )) = 1 − α.
For example, F0.025 (4, 17) = 3.66 and F0.05 (4, 17) = 2.96.
28 / 45
Comparing Variances of Two Normal Distributions
2
Theorem 5: Suppose X1 , · · · , Xn ∼ iid N (µX , σX
) (first sample) and
Y1 , · · · , Ym ∼ iid N (µY , σY2 ) (second sample) and the two samples are
2
independent. Define SX
=
F =
Pn
i=1 (Xi −X̄)
n−1
2
and SY2 =
Pm
i=1 (Yi −Ȳ )
m−1
2
. Then,
2
2
SX
/σX
∼ F (n − 1, m − 1).
2
SY /σY2
Proof:
From theorem 3, we get W1 =
W2 =
2
(m−1)SY
2
σY
2
(n−1)SX
∼ χ2 (n − 1) and
2
σX
∼ χ2 (m − 1). Since the two samples are independent, W1
and W2 are independent. By the definition of the F distribution, we obtain
F =
2
W1 /(n − 1)
S 2 /σX
= X
∼ F (n − 1, m − 1).
2
W2 /(m − 1)
SY /σY2
29 / 45
Comparing Variances of Two Normal Distributions
Example 6: Refer to Example 2. Suppose that another sample of 5 bottles is
to be taken from the output of the same bottling machine. Find the
probability that the sample variance of the volumes in these 5 bottles will be
at least 7 times as large as the sample variance of the volumes in the 9 bottles
that were initially sampled.
Solution: Since the two samples are taken from the same population, we
2
have σX
= σY2 . So
2
P (SX
> 7SY2 )
=
2
2
SX
/σX
P
>7
SY2 /σY2
P (F > 7), where F ∼ F (4, 8)
≈
P (F > 7.01) = 0.01,
=
by the F table.
Or using R, get P (F > 7) = 0.01002557 by command 1-pf(7,df1=4,df2=8).
30 / 45
The Central Limit Theorem
31 / 45
The Central Limit Theorem
In the previous section, we assumed Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). What if
Y1 , · · · , Yn are not normally distributed (e.g. Example 1)?
Can we still find the sampling distribution of Ȳ ? We can rely on “The
Central Limit Theorem”.
32 / 45
Convergence in Distribution
Consider a random variable Y and a sequence of random variables Xn
indexed by n = 1, 2, 3, · · · (that is, X1 , X2 , X3 , · · · ). Let FY (·) and FXn (·)
be the cdfs of Y and Xn , respectively.
Suppose
FXn (y) → FY (y) as n → ∞
for each value y ∈ R at which FY (y) is continuous. Then we say that
Xn converges to Y in distribution as n → ∞,
and we may express this statement mathematically as
d
Xn −−→ Y as n → ∞.
33 / 45
The Central Limit Theorem
Theorem 6 (CLT): Let Y1 , · · · , Yn be independent and identically
distributed random variables with E(Yi ) = µ and V ar(Yi ) = σ 2 < ∞.
Define
Ȳ − µ
√ =
Un =
σ/ n
√
n(Ȳ − µ)
.
σ
Then,
d
Un −−→ Z as n → ∞,
d
where Z ∼ N (0, 1). Or you can write Un −−→ N (0, 1) as n → ∞.
Note: The CLT implies that when n is large, it is reasonable to make the
2
following equivalent approximating statements: Ȳ ∼
˙ N (µ, σn ) or
Pn
˙ N (nµ, nσ 2 ).
i=1 Yi = nȲ ∼
34 / 45
The Central Limit Theorem
Note: The CLT makes no particular distributional assumptions about Yi s. It
assumes only that they are iid and with some common finite mean µ and
common finite non-zero variance σ 2 .
Since the cdf of Z, Φ(z), is continuous at all values z ∈ R, we have
P (Un ≤ z) = FUn (z) → Φ(z) as n → ∞ for any z ∈ R. It implies that
probability of Un can be approximated by N (0, 1) if n is large. Usually, a
value of n greater than 30 will ensure that the distribution of Un can be
closely approximated by N (0, 1).
35 / 45
The Central Limit Theorem
Example 7: 200 numbers are randomly chosen from between 0 and 1. Find
the probability that the average of these numbers is greater than 0.53.
Solution: Let Yi be the ith number, i = 1, · · · , n, where n = 200. Then,
Y1 , · · · , Yn ∼ iid U (0, 1) so that µ = E(Yi ) = 1/2 and
σ 2 = V ar(Yi ) = 1/12. Applying the CLT, we find
√
√
n(Ȳ − µ)
n(0.53 − µ)
P (Ȳ > 0.53) = P
>
σ
σ
√
= P
Z>
200(0.53 − 0.5)
p
1/12
!
≈ P (Z > 1.47) = 0.0708.
Hence, the probability of interest is approximately 0.0708 which is very
close to the true probability due to large sample size.
36 / 45
The Central Limit Theorem
Example 8: The service times for customers coming through a checkout
counter in a retail store are independent and identically distributed random
variables with mean 1.5 minutes and variance 1. Find the probability that
100 customers can be served in less than 2 hours of total service time.
Solution: If we let Yi denote the service time for the ith customer, then we
want
!
P
100
X
Yi ≤ 120
= P (Ȳ ≤ 120/200) = P (Ȳ ≤ 1.2).
i=1
Since the sample size n = 100 is large, we apply the CLT and get
√
P (Ȳ ≤ 1.2) = P
100(Ȳ − 1.5)
≤
1
√
100(1.2 − 1.5)
1
!
≈ P (Z ≤ −3) = 0.0013.
This small probability indicates that it is virtually impossible to serve 100
customers in only 2 hours.
37 / 45
The Normal Approximation to
the Binomial Distribution
38 / 45
Normal Approximation to Bin(n, p)
Theorem 7 (CLT for Sample Proportion)
Suppose that Y ∼ Bin(n, p). Then Y =
Pn
Y1 , · · · , Yn ∼ iid Bern(p). Define p̂ = Yn =
i=1 Yi , where
Pn
i=1 Yi
, which is regarded as
n
the sample proportion. Since µ = E(Yi ) = p and
σ 2 = V ar(Yi ) = p(1 − p), by the CLT, we have
√
n(p̂ − p) d
p
−−→ N (0, 1) as n → ∞.
p(1 − p)
Note: Equivalently, you can say p̂ ∼
˙ N (p, p(1 − p)/n) or
Pn
Y = i=1 Yi ∼
˙ N (np, np(1 − p)) when n is large.
39 / 45
Normal Approximation to Bin(n, p)
Example 9: Candidate A believes that she can win a city election if she can
earn at least 55% of the votes in precinct 1. She also believes that about 50%
of the city’s voters favor her. If n = 100 voters show up to vote at precinct 1,
what is the probability that candidate A will receive at least 55% of their
votes?
Solution: Let Y denote the number of voters at precinct 1 who vote for
candidate A. If we think of the n = 100 voters at precinct 1 as a random
sample from the city, then Y ∼ Bin(n = 100, p = 0.5). We want to know
P (p̂ ≥ 0.55) where p̂ = Y /n.
Since p̂ ∼
˙ N (p = 0.5, p(1 − p)/n = 0.0025), we get
0.55−0.5
√
P (p̂ ≥ 0.55) = P √p̂−0.5
≥
≈ P (Z ≥ 1) = 0.1587.
0.0025
0.0025
40 / 45
Normal Approximation to Bin(n, p)
Example 10: A die is rolled n = 120 times. Find the probability that at least
27 sixes come up.
Solution: Let Y be the number of 6s. Then Y ∼ Bin(120, 1/6) so that
Y ∼
˙ N 120 16 , 120 16 65 , i.e., N (20, 50/3). So
P (Y ≥ 27)
≈ P (U ≥ 27)
= P
=
where U ∼ N (20, 50/3)
!
27 − 20
Z≥ p
50/3
P (Z > 1.7146) = 0.0432.
41 / 45
The Continuity Correction
The exact probability is the area of the boxes above 27, 28, 29, · · · . We
have approximated this probability by 0.0436, which is the area under the
approximating normal density to the right of 27. But this area seems to be
too small by about half the area of the box above 27 (i.e. the left half of that
box). Thus, it would appear that a better approximation would be the area to
the right of 27 − 0.5 = 26.5 (as shaded below). We call “-0.5” here the
continuity correction.
42 / 45
The Continuity Correction
Let’s now apply this continuity correction to see what difference it makes:
P (Y ≥ 27)
≈
P (U ≥ 27 − 0.5)
where U ∼ N (20, 50/3)
!
27 − 0.5 − 20
p
Z≥
50/3
=
P
=
P (Z > 1.59) = 0.0559.
And the the exact probability can in fact be calculated
P (Y ≥ 27) =
y 120−y
100
X
5
120
1
y=27
y
6
6
= 0.0597.
We see that the continuity correction here does indeed improve the
approximation.
43 / 45
The Continuity Correction
Similarly, if you are interested in P (Y ≤ 10), P (U ≤ 10 + 0.5) would be
a better approximation than P (U ≤ 10).
In summary, the 0.5 that we added to the largest value of interest (making
it a little larger) and subtracted from the smallest value of interest (making it
a little smaller) is commonly called the continuity correction associated
with the normal approximation.
The normal approximation to binomial probabilities works well even for
moderately large n as long as p is not close to zero or one. A useful rule of
thumb is that the normal approximation to the binomial distribution is
p
p
appropriate when 0 < p − 3 pq/n and p + 3 pq/n < 1 where q = 1 − p.
44 / 45
Conclusion
• Have a good knowledge of the meaning of statistics and sampling
distributions.
• Master the sampling distributions related to normal distributions.
• Be familiar with the properties of the Chi-square, t and F distribution
and be able to find their quantiles.
• Know how to use the CLT to make inferences about the sample mean
and sample proportion.
45 / 45
Chapter 8: Estimation
Part (a): Point Estimation
STAT6039 Principles of Mathematical Statistics
Introduction
The purpose of statistics is to use the information contained in a sample to
make inferences about the population from which the sample is taken.
Since populations can be characterized by some numerical descriptive
measures called parameters, the objective of many statistical investigations
is to estimate the value of one or more relevant parameters.
Some important population parameters are the population mean µ,
population proportion p, population variance σ 2 , and population standard
deviation σ, or the functions of these parameters, etc. For example, we might
wish to estimate the mean waiting time µ at a supermarket checkout station
or the standard deviation σ of the error of measurement of an electronic
instrument.
1 / 48
Terminology
Target parameter is the parameter of interest in the experiment.
Point estimation:
• Use a single number to estimate the target parameter.
• For example, the point estimate of the average height of ANU students,
µ, is 165cm.
Interval estimation:
• Use an interval to estimate the target parameter.
• For example, the average height of ANU students, µ, will fall between
150cm and 180cm, which is an interval estimate of µ.
2 / 48
Terminology
Estimator:
• An estimator is a rule, often expressed as a formula, that tells how to
estimate a population parameter based on the measurements contained
in a sample. So it is a statistic or an interval with at least one
endpoint being statistics.
Pn
• For example, one point estimator of µ is Ȳ = i=1 Yi and one interval
n
√
√
estimator of µ is [Ȳ − zα/2 S/ n, Ȳ + zα/2 S/ n].
Estimate:
• An estimate is a realized value of an estimator, calculate based on the
sample data.
• For example, ȳ = 160+164+170+173+158 = 165(cm).
5
3 / 48
Point Estimation
4 / 48
Point Estimation
Many different estimators may be obtained for the same population
parameter. For example, sample mean and sample median can both be used
to estimate population mean.
How to evaluate the performance of an estimator?
5 / 48
Bias of a Point Estimator
Let θ be some target population parameter and let θ̂ denote a point
estimator of θ.
The bias of a point estimator θ̂ is given by B(θ̂) = E(θ̂) − θ.
If B(θ̂) = 0, or equivalently E(θ̂) = θ, we say θ̂ is an unbiased
estimator of θ. If B(θ̂) ̸= 0, θ̂ is said to be biased.
Unbiasedness is a desirable quality of a point estimator.
If B(θ̂) → 0 as n → ∞, we say θ̂ is asymptotically unbiased.
6 / 48
Bias of a Point Estimator
Theorem 1: Let Y1 , · · · , Yn ∼ iid (µ, σ 2 ) where σ 2 < ∞. Define
Ȳ =
Pn
i=1 Yi
n
and S 2 =
2
i=1 (Yi −Ȳ )
Pn
n−1
. Then,
(a) Ȳ is an unbiased estimator of population mean µ.
(b) S 2 is an unbiased estimator of σ 2 .
Corollary:
Sample proportion p̂ = Y /...
Essay Writing Service Features
Our Experience
No matter how complex your assignment is, we can find the right professional for your specific task. Achiever Papers is an essay writing company that hires only the smartest minds to help you with your projects. Our expertise allows us to provide students with high-quality academic writing, editing & proofreading services.Free Features
Free revision policy
$10Free bibliography & reference
$8Free title page
$8Free formatting
$8How Our Dissertation Writing Service Works
First, you will need to complete an order form. It's not difficult but, if anything is unclear, you may always chat with us so that we can guide you through it. On the order form, you will need to include some basic information concerning your order: subject, topic, number of pages, etc. We also encourage our clients to upload any relevant information or sources that will help.
Complete the order form
Once we have all the information and instructions that we need, we select the most suitable writer for your assignment. While everything seems to be clear, the writer, who has complete knowledge of the subject, may need clarification from you. It is at that point that you would receive a call or email from us.
Writer’s assignment
As soon as the writer has finished, it will be delivered both to the website and to your email address so that you will not miss it. If your deadline is close at hand, we will place a call to you to make sure that you receive the paper on time.
Completing the order and download