UW Principles of Mathematical Statistics Questions

Chapter 4: Continuous Random Variables andTheir Probability Distributions
STAT6039 Principles of Mathematical Statistics
Cumulative Distribution Function
The cumulative distribution function (cdf) of a random variable Y is
defined to be
F (y) = P (Y ≤ y), for − ∞ < y < ∞. We may also write F (y) as FY (y). 1 / 71 Cumulative Distribution Function Example 1: Suppose that Y ∼ Bin(2, 1/2). Find and sketch F (y). Solution: The probability function of Y is given by f (y) = 2 y 1 y 2 1 2−y which 2 yields P (Y = 0) = 1/4, P (Y = 1) = 1/2, P (Y = 2) = 1/4. For any y < 0, P (Y ≤ y) = 0 since the only values of Y that are assigned positive probabilities are 0, 1, and 2 and none of these values are less than or equal to y if y < 0. For any 0 ≤ y < 1, P (Y ≤ y) = P (Y = 0) = 1/4, For any 1 ≤ y < 2, P (Y ≤ y) = P (Y = 0) + P (Y = 1) = 3/4, For any y ≥ 2, P (Y ≤ y) = P (Y = 0) + P (Y = 1) + P (Y = 2) = 1. 2 / 71 Cumulative Distribution Function Solution (continued): In general, F (y) = P (Y ≤ y) =    0       1 4  3    4     1 for y < 0, for 0 ≤ y < 1, for 1 ≤ y < 2, for y ≥ 2. 3 / 71 Cumulative Distribution Function Note: From Example 1, it is clear that the cumulative distribution function stays flat between the possible values of Y and increases in jumps or steps at each of the possible values of Y . Functions that behave in such a manner are called step functions. Cumulative distribution functions of discrete random variables are always step functions because the cumulative distribution function increases only at the finite or countable number of points with positive probabilities. 4 / 71 Cumulative Distribution Function Theorem 1 (Properties of Cumulative Distribution Function): If F (y) is a cumulative distribution function, then: 1. F (−∞) ≡ limy→−∞ F (y) = 0. 2. F (∞) ≡ limy→∞ F (y) = 1. 3. F (y) is a nondecreasing function of y, i.e., if y1 and y2 are any values such that y1 < y2 , then F (y1 ) ≤ F (y2 ). 4. F (y) is right continuous, i.e. limδ→0+ F (y + δ) = F (y). (In Example 1 this corresponds to the fact that F (0) = 0.25, not 0.) 5 / 71 Continuous Random Variable For a continuous random variable: • Sample space is a continuous interval. • There are an infinite number of possible outcomes and they cannot be counted. • P (Y = y) = 0 for all y. We need different rules for doing probability calculations. Focus on P (Y ≤ y) instead. A random variable Y is said to be continuous if its cdf F (y) is continuous everywhere. Note: If P (Y = y0 ) = p0 > 0, then F (y) would have a discontinuity (jump)
of size p0 at the point y0 , violating the assumption that F (y) was continuous.
6 / 71
Continuous Random Variable
Example 2: Let Y be a number chosen randomly between 0 and 1. Find Y ’s
cdf. Is Y a cts rv?
Solution:
F (0.1) = P (Y ≤ 0.1) = 0.1,
F (0.5) = P (Y ≤ 0.5) = 0.5,
F (0.9) = P (Y ≤ 0.9) = 0.9, etc. Thus, we conclude



0, for y < 0,    F (y) = y, for 0 ≤ y ≤ 1,     1, for y > 1.
7 / 71
Probability Density Function
If there exists a nonnegative function f (y) such that,
Z y
F (y) =
f (t)dt,
−∞
we call f (y) the probability density function (pdf) of continuous random
variable Y .
For any y, if the derivative of F (y) exists,
f (y) =
dF (y)
= F ′ (y).
dy
We may also write it as fY (y).
Note: The area under the pdf of Y to the left of y is just F (y).
8 / 71
Probability Density Function
Theorem 2 (Properties of Probability Density Function):
If f (y) is a pdf of a continuous random variable Y , then
1. f (y) ≥ 0 for all y, −∞ < y < ∞. R∞ 2. −∞ f (y)dy = 1. Note: Pdf f (y) may be greater than 1 and need not be everywhere continuous. The total area under the pdf equals 1. 9 / 71 Probability Density Function Example 2 (continued): Let Y be a number chosen randomly between 0 and 1. Find the pdf of Y and graph it. Solution: Because the pdf f (y) is the derivative of F (y) when the derivative exists. Thus,    0, for y < 0,    f (y) = F ′ (y) = 1, for 0 < y < 1,      0, for y > 1,
and f (y) is undefined at y = 0 and y = 1.
10 / 71
Probability Density Function
Solution (continued):
11 / 71
Conventions and Simplifications for Notations
• It may be convenient to consider undefined values of a pdf as being
equal to zero.
• It may be convenient not to specify where a pdf is 0, nor where a cdf is
0 or 1. Thus we may write: F (y) = y, 0 ≤ y ≤ 1 and
f (y) = 1, 0 ≤ y ≤ 1 in Example 2.
• These details have no effect on calculations if considering a
continuous distribution, but may be important when considering a
discrete or mixed distribution.
• Graphs and formulae of continuous pdfs and cdfs can be simplified
accordingly.
12 / 71
Probability Density Function
Theorem 3:
If a continuous random variable Y has pdf f (y) and a < b, then the probability that Y falls in the interval [a, b] is Z b P (a ≤ Y ≤ b) = f (y)dy. a Note: Since P (Y = a) = P (Y = b) = 0, the above theorem implies that P (a ≤ Y ≤ b) = P (a < Y ≤ b) = P (a ≤ Y < b) = P (a < Y < b). 13 / 71 Probability Density Function Example 3: Given f (y) = cy 2 , 0 ≤ y ≤ 2, find the value of c for which f (y) is a valid pdf of a random variable Y and then calculate P (1 < Y < 2). Solution: We require a value for c such that R2 R∞ 3 2 1 = −∞ f (y)dy = 0 cy 2 dy = cy3 0 = 8c 3 . Then we find that c = 83 so that f (y) = 38 y 2 , 0 ≤ y ≤ 2. Thus, we have Z 2 P (1 < Y < 2) = 1 3 2 1 2 7 y dy = y 3 1 = . 8 8 8 14 / 71 Expected Value 15 / 71 Expected Value of a Continuous Random Variable Let Y be a continuous random variable with pdf f (y). Then the expected value of Y , E(Y ), is defined to be Z ∞ E(Y ) = yf (y)dy. −∞ provided that the integral exists. Let g(Y ) be a function of Y . Then the expected value of g(Y ) is given by Z ∞ E[g(Y )] = g(y)f (y)dy. −∞ provided that the integral exists. 16 / 71 Properties of Expected Value Theorem 4: Let a, b, c be constants and let g(Y ), g1 (Y ), g2 (Y ), ..., gk (Y ) be functions of a continuous random variable Y . Then the following results hold: 1. E(c) = c. 2. E[ag(Y ) + b] = aE[g(Y )] + b. 3. E[g1 (Y ) + · · · + gk (Y )] = E[g1 (Y )] + · · · + E[gk (Y )]. Proof: Similar to the proof of the expected values of a discrete random variable by replacing the sum with the integral. 17 / 71 Moments The kth (raw) moment of a random variable Y : µ′k = E(Y k ). The kth central moment of a random variable Y : µk = E[(Y − µ)k ]. Note: µ′1 = µ and µ1 = 0. V ar(Y ) = σ 2 = µ2 = µ′2 − µ2 . 18 / 71 Moments Example 4: In Example 3, we determined that f (y) = 38 y 2 for 0 ≤ y ≤ 2 is a valid pdf. If the random variable Y has this pdf, find µ = E(Y ) and σ 2 = V ar(Y ). Solution: By definition, we have Z 2 2 3 2 31 4 µ = E(Y ) = y y dy = y = 1.5. 8 84 0 0 The variance of Y can be found once we determine E(Y 2 ). 2 Z 2 E(Y ) = y 0 2 2 31 5 3 2 y dy = y = 2.4. 8 85 0 Thus, σ 2 = V ar(Y ) = E(Y 2 ) − µ2 = 2.4 − 1.52 = 0.15. 19 / 71 Uniform Distribution 20 / 71 Uniform Distribution A random variable Y has a uniform distribution with parameters a and b if and only if its pdf has the form f (y) = 1 ,a < y < b b−a where −∞ < a < b < ∞. We write Y ∼ U nif orm(a, b) or Y ∼ U nif (a, b) or Y ∼ U (a, b). We call a the lower bound parameter, and we call b the upper bound parameter. If U ∼ U (0, 1), we say U has the standard uniform distribution. 21 / 71 Uniform Distribution Example 5: Suppose Y ∼ U (a, b). Find the cdf of Y . Solution: Z y F (y) = Z y f (t)dt = −∞ a y−a 1 dt = , a < y < b. b−a b−a 22 / 71 Uniform Distribution Theorem 5: Let Y ∼ U (a, b). Then µ = E(Y ) = a+b (b − a)2 and σ 2 = V ar(Y ) = . 2 12 Proof: Left as an exercise. 23 / 71 Uniform Distribution Example 6: The length of time patients wait to see a doctor is uniformly distributed between 40 mins and 3 hrs. Find the probability of waiting between 1 and 2 hrs, given you waited over 90 mins. Solution: Let Y be the waiting time (in mins) so we have Y ∼ U (40, 180) 1 with pdf f (y) = 140 , 40 < y < 180. Then, 90) = P (9090) . Since P (90 < Y R 180 1 180−90 9 = 14 , 140 dy = 140 R 120 1 3 < 120) = 90 140 dy = 120−90 = 14 , 140 P (Y > 90) =
90
we get
3/14
P (60 < Y < 120|Y > 90) = 9/14
= 1/3.
24 / 71
More Discussion
• All intervals of the same length on the distribution’s support are
equally probable.
• Often used for bounded data. In practice, if we randomly select a value
from some fixed interval, say (a, b), then the value follows U (a, b).
• The standard uniform distribution U (0, 1) can be used to generate
some random variables following other distributions. If U ∼ U (0, 1),
then random variable FY−1 (U ) has the same distribution of Y .
25 / 71
Normal Distribution
26 / 71
Normal Distribution
A random variable Y is said to have a normal distribution if and only if,
for σ > 0 and −∞ < µ < ∞, the pdf of Y is f (y) = √ (y−µ)2 1 e− 2σ2 , −∞ < y < ∞. 2πσ We write Y ∼ N (µ, σ 2 ). Theorem 6: Let Y ∼ N (µ, σ 2 ). Then E(Y ) = µ and V ar(Y ) = σ 2 . Note: We call µ the mean parameter and σ 2 the variance parameter. Or we call σ the standard deviation parameter. 27 / 71 Normal Distribution The pdf of normal distribution is bell-shaped, symmetric about µ, reaches highest point at y = µ, tends to zero as y → ±∞. 28 / 71 Normal Distribution Changing µ (different means) will shift the pdf curve left and right. Changing σ 2 (different variances) will make the pdf curve become more peaked or more flattened. 29 / 71 Normal Distribution The cdf function of Y ∼ N (µ, σ 2 ) is Z y F (y) = √ −∞ (t−µ)2 1 e− 2σ2 dt. 2πσ For any −∞ < a < b < ∞ P (a < Y < b) = F (b) − F (a) = Rb a √ 1 e− 2πσ (t−µ)2 2σ 2 dt. However, the closed-form expression for this integral does not exist; hence, its evaluation requires the use of numerical integration techniques. 30 / 71 Standard Normal Distribution If Z ∼ N (0, 1), we say that Z has the standard normal distribution. The pdf can be written as y2 1 ϕ(y) = √ e− 2 . 2π The cdf can be written as Z y Φ(y) = t2 1 √ e− 2 dt. 2π −∞ 31 / 71 z-Table The table of probabilities for a standard normal distribution Z ∼ N (0, 1) is called a z-table (Table 4 in the “statistical table” file) and it lists probabilities of the form P (Z > z) for various values of z. Some books
have tables of P (Z < z). From the table, for example we would get P (Z > 1.67) = 0.0475,
P (Z > 1.96) = 0.0250.
Using symmetry, we get
Φ(−1.67) = P (Z < −1.67) = P (Z > 1.67) = 0.0475,
Φ(1.67) = P (Z < 1.67) = 1 − P (Z > 1.67) = 1 − 0.0475 = 0.9525,
P (0 < Z < 1.67) = P (Z < 1.67)−P (Z < 0) = 0.9525−0.5 = 0.4525, P (−1.96 < Z < 1.96) = 1 − 2P (Z > 1.96) = 1 − 2 × 0.0250 = 0.95.
32 / 71
Quantile
Let Y be a random variable with cdf F (y). For each p strictly between 0
and 1, define F −1 (p) to be the smallest value y such that F (y) ≥ p. Then
F −1 (p) is called the p quantile of Y or the 100p-th percentile of Y .
If Y is a continuous random variable, p quantile of Y or the 100p-th
percentile of Y is the value y such that F (y) = p.
For example, the median (50th percentile) of N (µ, σ 2 ) is µ and the
median (50th percentile) of U (a, b) is a+b
2 .
33 / 71
z-Table
z-Table can also be used to find quantiles.
The (lower) p quantile function of Z is Φ−1 (p).
For example, Φ−1 (0.9525) = 1.67 and Φ−1 (0.0475) = −1.67.
The upper p quantile function of Z is zp = Φ−1 (1 − p), i.e.
P (Z < zp ) = 1 − p so that P (Z > zp ) = p.
For example, z0.0475 = 1.67
and z0.025 = 1.96 (a common fact everyone should memorise).
Note: When looking up a probability in order to find zp , look up the closest
probability in the table, or if the probability lies exactly in the middle
between two probabilities in the table, choose the mid-point of the two
corresponding z-values as zp .
34 / 71
Normal Distribution
Theorem 7: If Y ∼ N (µ, σ 2 ), then the linear transformation
Z=
Y −µ
∼ N (0, 1)
σ
standardizes Y to be a N (0, 1) random variable.
Note: “Standardizing” a random variable usually means subtracting its mean
and then (after that) dividing by the random variable’s standard deviation.
35 / 71
Normal Distribution
Example 7: Suppose that Y ∼ N (10, 16). Find P (7 < Y < 11). ∼ N (0, 1). Solution: Since Y ∼ N (10, 16), we have Z = Y −10 4 Then, 7 − 10 Y − 10 11 − 10 P (7 < Y < 11) = P < < 4 4 4 = P (−0.75 < Z < 0.25) = P (Z < 0.25) − P (Z < −0.75) = 1 − P (Z > 0.25) − P (Z > 0.75)
=
1 − 0.4013 − 0.2266
=
0.3721.
36 / 71
More Discussion
• Normal distribution is symmetric and has bell-shape. Mean, median
and mode (the value such that f (y) is maximized) are equal to µ.
• If you find the histogram of a sample has a bell-shape and quite
symmetric, then normal distribution can be used to model such data.
R
2
1
• We can use the property that ∞ √ 1 2 e− 2σ2 (y−µ) dy = 1 to
−∞ 2πσ
R∞
2
calculate integral −∞ e−ay +by dy where a > 0.
More properties about normal distribution will be discussed in the
following chapters.
37 / 71
Gamma Distribution
38 / 71
Gamma Distribution
A random variable Y is said to have a gamma distribution with
parameters α > 0 and β > 0 if and only if the pdf of Y is
f (y) =
where
Z ∞
Γ(α) =
y α−1 e−y/β
, 0 ≤ y < ∞, β α Γ(α) tα−1 e−t dt (gamma function). 0 We write Y ∼ Gamma(α, β), or Y ∼ G(α, β). Note: We call α the shape parameter and β the scale parameter. 39 / 71 Gamma Distribution Properties of gamma function: Γ(α) = (α − 1)Γ(α − 1) for any α > 1.
Γ(α) = (α − 1)! for any postive integer α.
√
Γ(1/2) = π = 1.7725 (to four decimals).
√
Γ(1.5) = 0.5Γ(0.5) = 0.5 π = 0.8862 (to four decimals).
Γ(2.5) = 1.5Γ(1.5) = 1.3293
(to four decimals).
Γ(α) has a minimum of 0.8856 at α = 1.47 (to two decimals).
40 / 71
Gamma Distribution
Note: Nonnegative and right skewed.
41 / 71
Gamma Distribution
Theorem 8: Let Y ∼ G(α, β). Then
µ = E(Y ) = αβ and σ 2 = V ar(Y ) = αβ 2 .
Theorem 9: Let Y ∼ G(α, β). Then
X = kY ∼ G(α, kβ)
42 / 71
Chi-square Distribution
If Y ∼ G(m/2, 2), we say that Y has the chi-square distribution with
parameter m. We write Y ∼ χ2 (m) and call m the degrees of freedom.
Theorem 10: Let Y ∼ χ2 (m). Then
µ = E(Y ) = m and σ 2 = V ar(Y ) = 2m.
Note: Quantiles of χ2 (m) can be obtained from Table 6 in the “statistical
table” file.
43 / 71
Exponential Distribution
If Y ∼ G(1, β), we say that Y has the exponential distribution with
parameter β. We write Y ∼ Exp(β).
The pdf of Y is f (y) = β1 e−y/β , y > 0.
If Y ∼ Exp(1), we say Y has the standard exponential distribution.
Theorem 11: Let Y ∼ Exp(β). Then
µ = E(Y ) = β and σ 2 = V ar(Y ) = β 2 .
Note: Exp(2) = G(1, 2) = χ2 (2).
44 / 71
Exponential Distribution
Example 8: Find the cdf of Y ∼ Exp(β). Then show that if a > 0 and
b > 0, P (Y > a + b|Y > a) = P (Y > b).
Solution: The cdf of Y is
y
Ry
F (y) = 0 β1 e−t/β dt = −e−t/β = 1 − e−y/β , y > 0.
0
We have P (Y > y) = 1 − P (Y ≤ y) = 1 − F (y) = e−y/β , y > 0.
Therefore,
P (Y > a + b|Y > a)
=
P (Y > a + b)
P ({Y > a + b} ∩ {Y > a}
=
P (Y > a)
P (Y > a)
e−(a+b)/β
e−a/β
−b/β
= e
= P (Y > b).
=
The conditional probability does not depend on the past information a.
45 / 71
Exponential Distribution
Memoryless Property: P (Y > a + b|Y > a) = P (Y > b).
The exponential distribution is memoryless because the past has no
impacts on its future behaviour.
For example, suppose that jobs in our system have exponentially
distributed service times. If we have a job that’s been running for one hour,
the probability that a job runs for two additional hours is the same as the
probability that it ran for two hours originally, regardless of how long it’s
been running.
Every instant is like the beginning of a new random period, which has the
same distribution regardless of how much time has already elapsed.
46 / 71
More Discussion
Gamma Distribution:
• The shape of pdf is right skewed so that gamma distribution is often
used for modeling nonnegative skewed data, like the size of insurance
claims and rainfalls, etc.
• In Bayesian analysis, the gamma distribution is widely used as a
conjugate prior for some parameters. We can also use the property that
R ∞ yα−1 e−y/β
R ∞ a −by
dy where
β α Γ(α) dy = 1 to calculate integral 0 y e
0
a > −1, b > 0.
47 / 71
More Discussion
Chi-Squared Distribution:
• One of the most widely used probability distributions in inferential
statistics, notably in hypothesis testing and construction of confidence
intervals.
• The squared sum of m independent standard normal random variables
follows a chi-squared distribution with m degrees of freedom. (e.g.
Pm
Y = i=1 Zi ∼ χ2 (m), where Zi , i = 1, . . . , m are independent
standard normal random variables.)
48 / 71
More Discussion
Exponential Distribution:
• The exponential distribution are often used to model waiting times.
• For example, in queuing theory, the service times of agents in a system
(e.g. how long it takes for a bank teller to serve a customer) are often
modeled as exponentially distributed random variables.
49 / 71
Beta Distribution
50 / 71
Beta Distribution
A random variable Y is said to have a beta distribution with parameters
α > 0 and β > 0 if and only if the pdf of Y is
f (y) =
y α−1 (1 − y)β−1
, 0 < y < 1, B(α, β) where B(α, β) = Γ(α)Γ(β) Γ(α + β) (beta function). We write Y ∼ Beta(α, β). Note: If α = β = 1, f (y) = 1, 0 < y < 1. Thus, Beta(1, 1) = U (0, 1). 51 / 71 Beta Distribution 52 / 71 Beta Distribution Theorem 12: Let Y ∼ B(α, β). Then µ = E(Y ) = αβ α and σ 2 = V ar(Y ) = . 2 α+β (α + β) (α + β + 1) 53 / 71 Beta Distribution Example 9: A gasoline wholesale distributor has bulk storage tanks that hold fixed supplies and are filled every Monday. Of interest to the wholesaler is the proportion of this supply that is sold during the week. Over many weeks of observation, the distributor found that this proportion could be modeled by a beta distribution with α = 4 and β = 2. Find the probability that the wholesaler will sell at least 90% of her stock in a given week. Solution: If Y denotes the proportion sold during the week, then Y ∼ Beta(4, 2). So (1−y)2−1 Γ(6) = Γ(4)Γ(2) (y 3 − y 4 ) = 20(y 3 − y 4 ), 0 < y < 1. B(4,2) ( ) 1 1 R1 1 4 1 5 3 4 > 0.9) = 0.9 20(y − y )dy = 20 4 y
− 5y
= 0.08.
0.9
0.9
f (y) = y
P (Y
4−1
It is not very likely that 90% of the stock will be sold in a given week.
54 / 71
More Discussion
• Often used to model proportion or percentage data since the possible
value for beta distribution is in (0,1).
• The beta distribution has an important application in the theory of
order statistics of uniform distributions.
• In Bayesian analysis, the beta distribution is widely used as a conjugate
prior of parameter p for the Bernoulli, binomial, negative binomial and
geometric distributions.
R α−1
β−1
• We can use the property that 1 y (1−y) dy = 1 to calculate
B(α,β)
0
R1
integral 0 y a (1 − y)b dy where a > −1, b > −1.
55 / 71
Moment Generating Functions
56 / 71
Moment Generating Functions
The moment generating function (mgf) of a random variable Y is defined
to be m(t) = E(etY ). It exists if there is a constant b > 0 such that m(t) is
finite for |t| ≤ b.
Theorem 13: Let Y be a continuous random variable with pdf f (y) and
g(Y ) be a function of Y . Then, the moment generating function of g(Y ) is
E[e
tg(Y )
Z ∞
]=
etg(y) f (y)dy.
−∞
57 / 71
Moment Generating Functions
Two important applications:
1. To compute raw moments, according to the equation µ′k = m(k) (0).
Here, m(k) (0) denotes the kth derivative of m(t) evaluated at t = 0, i.e.
dk m(t)
dtk
. We may also write m(0) (t) as m(t), m(1) (t) as m′ (t) and
t=0
m(2) (t) as m′′ (t), etc.
2. “If two random variables X and Y have the same mgf, then they also
have the same distribution”. It follows by “the uniqueness theorem”, a result
in pure mathematics.
58 / 71
Moment Generating Functions
Example 10: Find the moment generating function of a gamma distributed
random variable and calculate its µ′k .
Solution:
#

Z ∞
1
1
y α−1 e−y/β
y α−1 exp −y
dy = α
−t
dy
α
β Γ(α)
β Γ(α) 0
β
0

Z ∞
1
−y
= α
y α−1 exp
dy
β Γ(α) 0
β/(1 − βt)

Z
{β/(1 − βt)}α ∞
1
−y
α−1
=
y
exp
dy
α
βα
{β/(1 − βt)} Γ(α)
β/(1 − βt)
0
Z ∞

ety
m(t) = E etY =
”
{β/(1 − βt)}α
×1
βα
1
= (1 − βt)−α ,
=
(1 − βt)α
=
1
α−1 exp
where {β/(1−βt)}
α Γ(α) y
h
t < 1/β −y β/(1−βt) i is the pdf of G(α, β/(1 − βt)). 59 / 71 Moment Generating Functions Solution (continued): Since m(t) = (1 − βt)−α , we have dm(t) = (−α)(1 − βt)−(α+1) (−β) = αβ(1 − βt)−(α+1) dt ′ dm (t) m′′ (t) = dt = −(α + 1)αβ(1 − βt)−(α+2) (−β) = α(α + 1)β 2 (1 − βt)−(α+2) . m′ (t) = m(3) (t) = dm′′ (t) = α(α + 1)(α + 2)β 3 (1 − βt)−(α+3) . dt In general, m(k) (t) = dk m(t) = α · · · (α + k − 1)β k (1 − βt)−(α+k) . dtk Thus, we have µ′k = m(k) (0) = α · · · (α + k − 1)β k . For example, µ = µ′1 = αβ and µ′2 = α(α + 1)β 2 so that σ 2 = µ′2 − µ2 = α(α + 1)β 2 − α2 β 2 = αβ 2 . 60 / 71 Summary of Contiuous Random Variables The table is from page 838 of: Wackerly, Mendenhall and Scheaffer (2008), Mathematical Statistics with Applications. 61 / 71 Chebyshev’s Theorem 62 / 71 Chebyshev’s Theorem (Review) Theorem 14: Let Y be a random variable with mean µ and finite variance σ 2 . Then, for any constant k > 0
P (|Y − µ| < kσ) ≥ 1 − 1 k2 or P (|Y − µ| ≥ kσ) ≤ 1 . k2 Note: The result applies to any probability distribution (both discrete and continuous). 63 / 71 Chebyshev’s Theorem Example 11: Suppose that experience has shown that the length of time Y (in minutes) required to conduct a periodic maintenance check on a dictating machine follows a gamma distribution with α = 3.1 and β = 2. A new maintenance worker takes 22.5 minutes to check the machine. Does this length of time to perform a maintenance check disagree with prior experience? Solution: Since Y ∼ G(3.1, 2), we have µ = αβ = 6.2 and σ 2 = αβ 2 = 12.4. It √ follows that σ = 12.4 = 3.52. By Chebyshev’s Theorem, we have 1 P (Y ≥ 22.5 or Y ≤ 10.1) = P (|Y − µ| ≥ 4.63σ) ≤ 4.63 2 = 0.0466. 64 / 71 Chebyshev’s Theorem Solution (continued): Therefore, P (Y ≥ 22.5) ≤ P (Y ≥ 22.5 or Y ≤ 10.1) ≤ 0.0466. This probability is based on the assumption that the distribution of maintenance times has not changed from prior experience. Since that P (Y ≥ 22.5) is small, we must conclude either that the time of maintenance taken by our new maintenance worker has generated by chance a lengthy maintenance time that occurs with low probability or that the new worker is somewhat slower than preceding ones. Considering the low probability of P (Y ≥ 22.5), we favor the latter view. 65 / 71 Mixed Distribution 66 / 71 Mixed Distribution A random variable Y is mixed and has a mixed distribution if its cdf is continuous but not all flat over some intervals and also has some isolated points with positive probabilities. For any Y is mixed if its cdf has the form F (y) = cFX (y) + (1 − c)FZ (y), where 0 < c < 1, FX (y) is the cdf of a discrete random variable X, FZ (y) is the cdf of a continuous random variable Z. Note: Y has a mixed distribution if it is neither discrete not continuous but a “mixture” of those two kinds, in the sense that the cdf of Y is a weighted average of a discrete cdf and a continuous cdf. 67 / 71 Mixed Distribution Theorem 15: If Y is mixed with cdf F (y) = cFX (y) + (1 − c)FZ (y), where 0 < c < 1, FX (y) is the cdf of a discrete random variable X, FZ (y) is the cdf of a continuous random variable Z, then: E(Y ) = cE(X) + (1 − c)E(Z). And for any function g(Y ), E[g(Y )] = cE[g(X)] + (1 − c)E[g(Z)]. 68 / 71 Mixed Distribution Example 12: A light bulb has a 20% chance of failing immediately, and otherwise its lifetime follows an exponential distribution with mean 100 hours. Find the cdf, mean and standard deviation of Y , the overall lifetime of the lightbulb. Solution: Let Y be the overall lifetime of the lightbulb (in 100 hours) so that Y is nonnegative. Since a light bulb has a 20% chance of failing immediately, we know P (Y = 0) = 0.2 so that P (Y ̸= 0) = P (Y > 0) = 0.8.
For any y < 0, F (y) = 0. For any y ≥ 0, we have F (y) = P (Y ≤ y) = P (Y ≤ y|Y = 0)P (Y = 0) + P (Y ≤ y|Y > 0)P (Y > 0)
=
1 × 0.2 + (1 − e−y ) × 0.8 = 0.2 + 0.8(1 − e−y ).
69 / 71
Mixed Distribution
Solution (continue):
Then, F (y) = 0.2FX (y) + (1 − 0.2)FZ (y),
where X ∼ Bern(0) and Z ∼ Exp(1).
It is easy to get E(X) = 0, E(X 2 ) = 0, E(Z) = 1,
E(Z 2 ) = V ar(Z) + [E(Z)]2 = 1 + 12 = 2.
Thus,
E(Y ) = 0.2E(X) + 0.8E(Z) = 0.8 (i.e. 80 hours),
E(Y 2 ) = 0.2E(X 2 ) + 0.8E(Z 2 ) = 1.6,
σ 2 = V ar(Y ) = E(Y 2 ) − [E(Y )]2 = 1.6 − 0.82 = 0.96,
√
σ = 0.96 = 0.9798 (i.e. 97.98 hours).
70 / 71
Conclusion
• Be able to distinguish different continuous random variables and apply
those in real applications.
• Be familiar with the probability density function, expected value,
variance and moment generating function for each commonly used
continuous discrete random variable.
• Know how to calculate expected value, variance, quantile and moment
generating function by using definitions and properties.
• Can apply Chebyshev’s Theorem to get the bound of the probability.
• Know how to obtain the cdf, expected value and variance for mixed
distribution.
71 / 71
Chapter 5: Multivariate Probability Distributions
STAT6039 Principles of Mathematical Statistics
Bivariate and Multivariate
Random Variable
1 / 65
Joint Probability Mass Function
Example 1: A die is rolled once. Let X = number of 6s and Y = number of
even numbers. Find the joint probability P (X = x, Y = y) where x, y are
the possible values of X and Y , respectively.
Solution:
Outcome
1
2
3
4
5
6
Value of X
0
0
0
0
0
1
Value of Y
0
1
0
1
0
1
P (X = 1, Y = 1) = P ({6}) = 16
P (X = 0, Y = 1) = P ({2}) + P ({4}) = 13
P (X = 0, Y = 0) = P ({1}) + P ({3}) + P ({5}) = 12
2 / 65
Joint Probability Mass Function
Let X and Y be discrete random variables. The joint (or bivariate)
probability (mass) function of X and Y is defined to be
f (x, y) = P (X = x, Y = y), −∞ < x < ∞, −∞ < y < ∞. Theorem 1: If X and Y are discrete random variables with joint pmf f (x, y), then 1. 0 ≤ f (x, y) ≤ 1 for all x, y. P 2. all x,y f (x, y) = 1, where the summation is over all values (x, y) that are assigned nonzero probabilities. Note: We may also write this function as fX,Y (x, y) or p(x, y) or pX,Y (x, y) and refer it to the joint pmf of X and Y . 3 / 65 Joint Probability Mass Function Example 2: A die is rolled once. Let X = number of 6s and Y = number of even numbers. Find the joint pmf of X and Y . Solution:    1/2    The joint pmf of X and Y is f (x, y) = 1/3      1/6 Table: f (x, y) y=0 y=1 x=0 1/2 1/3 x=1 x=y=0 x = 0, y = 1 x=y=1 1/6 4 / 65 Joint Cumulative Distribution Function For any random variables X and Y , the joint (or bivariate) cumulative distribution function is defined to be F (x, y) = P (X ≤ x, Y ≤ y), for − ∞ < x < ∞, −∞ < y < ∞. We may also write it as FX,Y (x, y). Note: For two discrete variables X and Y , F (x, y) is given by P P F (x1 , y1 ) = x≤x1 y≤y1 f (x, y). 5 / 65 Joint Cumulative Distribution Function Example 3: Refer to Example 2. Find the joint cdf of X and Y . Solution: The  joint cdf of X and Y is   0 x < 0 or y < 0 or both       1/2 x ≥ 0, 0 ≤ y < 1 F (x, y) =    5/6 0 ≤ x < 1, y ≥ 1      1 x ≥ 1, y ≥ 1 6 / 65 Joint Cumulative Distribution Function Four properties of joint cdfs: 1. F (x, y) → 0 as x → −∞, or y → −∞ or both. 2. F (x, y) → 1 as x → ∞ and y → ∞. 3. F (x, y) is a nondecreasing function in both x and y directions. 4. F (x, y) is right continuous in both x and y directions. 7 / 65 Joint Probability Density Function Let X and Y be continuous random variables with joint cdf F (x, y). If there exists a nonnegative function f (x, y), such that Z x Z y F (x, y) = f (t1 , t2 )dt2 dt1 −∞ −∞ for all −∞ < x < ∞, −∞ < y < ∞, then X and Y are said to be jointly continuous random variables. The function f (x, y) is called the joint (or bivariate) probability density function. Note: The joint pdf f (x, y) of X and Y can be obtained from its joint cdf 2 F (x,y) f (x, y) = d dxdy , wherever the derivative exists. 8 / 65 Joint Probability Density Function Theorem 2: If X and Y are jointly continuous random variables with joint pdf f (x, y), then 1. f (x, y) ≥ 0 for all x, y. R∞ R∞ 2. −∞ −∞ f (x, y)dxdy = 1 (the entire volume under the density surface is 1). Theorem 3: If X and Y are jointly continuous random variables with joint pdf f (x, y), then Z x2 Z y2 P (x1 ≤ X ≤ x2 , y1 ≤ Y ≤ y2 ) = f (x, y)dydx x1 y1 (a volume under the joint pdf). 9 / 65 Joint Probability Density Function 10 / 65 Joint Probability Density Function Example 4: Suppose that X and Y are two continuous random variables with joint pdf f (x, y) = cxy, 0 < x < 2y < 4. Find P (X > 1, Y < 1). Solution: First, we need to find c. Z 2 Z 2y 1= Z 2 cxydxdy = 0 0 0 2y cx2 2 0 ! Z 2 ydy = 0 2 2cy 3 dy = 2cy 4 = 8c. 4 0 Thus, c = 18 and f (x, y) = 81 xy, 0 < x < 2y < 4. ! Z 1 Z 2y Z 1 Z 1 2y 1 x2 (4y 3 − y) P (X > 1, Y < 1) = xydxdy = ydy = dy 1 1 1 8 16 1 16 1 2 2 2 = 9 . 256 11 / 65 Multivariate Random Variable The joint cdf of more than one random variable is called a multivariate cdf. Suppose we have n random variables Y1 , Y2 , · · · , Yn , then the joint cdf is defined to be F (y1 , y2 , · · · , yn ) = P (Y1 ≤ y1 , Y2 ≤ y2 , · · · , Yn ≤ yn ). Denote Y = (Y1 , Y2 , · · · , Yn )⊤ and let y = (y1 , y2 · · · , yn )⊤ , then the cdf of Y becomes F (y) = F (y1 , y2 , · · · , yn ), which is defined on n-dimensional space Rn . We call Y the multivariate random variable. If n = 2, we often say it is a bivariate random variable. 12 / 65 Multivariate Random Variable Discrete multivariate random variable: If multivariate random variable Y = (Y1 , Y2 , · · · , Yn )⊤ can only take a finite number or a countably infinite sequence of different possible values (y1 , y2 , · · · , yn )⊤ in Rn , it is a discrete multivariate random variable. Equivalently, if Y1 , Y2 , · · · , Yn are all discrete random variables, the vector Y = (Y1 , Y2 , · · · , Yn )⊤ is a discrete multivariate random variable. The pmf of Y or the joint pmf of Y1 , Y2 , · · · , Yn is defined to be P (Y = y) = f (y) = P (Y1 = y1 , Y2 = y2 , · · · , Yn = yn ) = f (y1 , y2 , · · · , yn ). 13 / 65 Multivariate Random Variable Continuous multivariate random variable: If Y1 , Y2 , · · · , Yn are all continuous random variables, the vector Y = (Y1 , Y2 , · · · , Yn )⊤ is a continuous multivariate random variable. The nonnegative function f (y1 , y2 , · · · , yn ), such that Z y1 Z y2 Z yn F (y) = F (y1 , y2 , · · · , yn ) = ··· f (t1 , t2 , · · · , tn )dtn · · · dt2 dt1 , −∞ −∞ −∞ is said to be the pdf of Y or the joint pdf of Y1 , Y2 , · · · , Yn . The pdf of Y or the joint pdf of Y1 , Y2 , · · · , Yn can be derived from f (y) = f (y1 , y2 · · · , yn ) = dn F (y1 , y2 , · · · , yn ) dy1 dy2 · · · dyn at all points y = (y1 , · · · , yn ) where the derivative exists. 14 / 65 Marginal and Conditional Probability Distributions 15 / 65 Marginal Probability Distributions Let X and Y be jointly discrete random variables with joint pmf f (x, y). Then the marginal probability mass functions of X and Y , respectively, are given by fX (x) = X f (x, y) and fY (y) = all y X f (x, y). all x Let X and Y be jointly continuous random variables with joint pdf f (x, y). Then the marginal density functions of X and Y , respectively, are given by Z ∞ fX (x) = Z ∞ f (x, y)dy and fY (y) = −∞ f (x, y)dx. −∞ 16 / 65 Marginal Probability Distributions Example 5: The joint pmf of X and Y is given below. Please find the marginal pmf of X and Y . f (x, y) y=0 x=0 1/2 x=1 y=1 1/3 1/6 Solution: fX (0) = P fX (1) = P fY (0) = P fY (1) = P all y f (0, y) = f (0, 0) + f (0, 1) = 1/2 + 1/3 = 5/6, all y f (1, y) = f (1, 1) = 1/6, all x f (x, 0) = f (0, 0) = 1/2, all x f (x, 1) = f (0, 1) + f (1, 1) = 1/3 + 1/6 = 1/2. Therefore, X ∼ Bern(1/6) and Y ∼ Bern(1/2). 17 / 65 Marginal Probability Distributions Example 5: The joint pmf of X and Y is given below. Please find the marginal pmf of X and Y . f (x, y) y=0 y=1 x=0 1/2 1/3 x=1 1/6 Solution (continued): Equivalently, what we have done is to compute column and row totals. f (x, y) y=0 y=1 fX (x) x=0 1/2 1/3 5/6 1/6 1/6 x=1 fY (y) 1/2 1/2 18 / 65 Marginal Probability Distributions Example 6: Suppose that X and Y are two continuous random variables with joint pdf f (x, y) = 18 xy, 0 < x < 2y < 4. Find marginal pdfs of X and Y , respectively. Solution: 2 Z 2 fX (x) = 1 x y2 x x3 xydy = = − , 8 2 x/2 4 64 x/2 8 Z 2y fY (y) = 0 0 < x < 4. 2y y3 1 y x2 xydx = = , 8 8 2 0 4 0 < y < 2. 19 / 65 Marginal Probability Distributions Marginal pmf (discrete): P P f1 (y1 ) = all y2 · · · all yn f (y1 , y2 , · · · , yn ) P P P f13 (y1 , y3 ) = all y2 all y4 · · · all yn f (y1 , y2 , · · · , yn ) Marginal pdf (continuous): R∞ R∞ f1 (y1 ) = −∞ · · · −∞ f (y1 , y2 , · · · , yn )dy2 · · · dyn R∞ R∞ f13 (y1 , y3 ) = −∞ · · · −∞ f (y1 , y2 , · · · , yn )dy2 dy4 · · · dyn Marginal cdf (discrete or continuous): F1 (y1 ) = P (Y1 ≤ y1 ) = P (Y1 ≤ y1 , Y2 < ∞, · · · , Yn < ∞) = limyj →∞,j=2,··· ,n F (y1 , y2 , · · · , yn ) F13 (y1 ) = P (Y1 ≤ y1 , Y3 ≤ y3 ) = P (Y1 ≤ y1 , Y2 < ∞, Y3 ≤ y3 , Y4 < ∞, · · · , Yn < ∞) = limyj →∞,j=2,4,··· ,n F (y1 , y2 , · · · , yn ) 20 / 65 Conditional Probability Distributions If X and Y are jointly discrete random variables with joint pmf f (x, y) and marginal pmfs fX (x) and fY (y), respectively, then the conditional probability mass function of X given Y = y is f (x|y) = P (X = x|Y = y) = P (X = x, Y = y) f (x, y) = P (Y = y) fY (y) provided that P (Y = y) > 0. We may write it as fX|Y (x|y).
Conditional cdf of X given Y = y is defined accordingly
F (x|y) = P (X ≤ x|Y = y).
Note: Similarly, we can define f (y|x) and F (y|x).
21 / 65
Conditional Probability Distributions
Example 7: The joint and marginal pmfs of X and Y are given below.
Please find the conditional pmf of X given Y = 1.
f (x, y)
y=0
y=1
fX (x)
x=0
1/2
1/3
5/6
1/6
1/6
x=1
fY (y)
1/2
1/2
Solution: f (x|1) = ff(x,1)
for x = 0, 1.
Y (1)
1/3
f (1,1)
1/6
2
1
Explicitly, fX|Y (0|1) = ffY(0,1)
(1) = 1/2 = 3 , fX|Y (1|1) = fY (1) = 1/2 = 3 .


2/3 x = 0
So f (x|1) =
.

1/3 x = 1
Thus, (X|Y = 1) ∼ Bern(1/3).
22 / 65
Conditional Probability Distributions
If X and Y are jointly continuous random variables with joint pdf
f (x, y) and marginal pdf fX (x) and fY (y), then the conditional cdf of X
given Y = y is
F (x|y) = P (X ≤ x|Y = y).
For any y such that fY (y) > 0, it can be showed that
Z x
F (x|y) =
f (t, y)
dt.
−∞ fY (y)
Denote f (x|y) = ff(x,y)
and call it conditional pdf of X given Y = y.
Y (y)
Note: Similarly, we can define f (y|x) and F (y|x).
23 / 65
Conditional Probability Distributions
Example 8: Suppose that X and Y are two continuous random variables
with joint pdf f (x, y) = 18 xy, 0 < x < 2y < 4. Find conditional pdf of X given Y = y. 3 Solution: From Example 6, we know fY (y) = y4 , 0 < y < 2. Thus, f (x|y) = f (x, y) xy/8 x = 3 = 2 , 0 < x < 2y < 4. fY (y) y /4 2y 24 / 65 Conditional Probability Distributions The definition of conditional pmf or pdf can be generalized to multivariate case. The conditional pmf or pdf of Y1 given Y2 = y2 , · · · , Yn = yn is f (y1 |y2 , · · · , yn ) = f (y1 , · · · , yn ) . f2,··· ,n (y2 , · · · , yn ) The joint conditional pmf or pdf of Y1 and Y3 given Y2 = y2 , Y4 = y4 · · · , Yn = yn is f (y1 , y3 |y2 , y4 , · · · , yn ) = f (y1 , · · · , yn ) . f2,4,··· ,n (y2 , y4 · · · , yn ) 25 / 65 Independent Random Variables 26 / 65 Independence Let X have cdf FX (x), Y have cdf FY (y), and X and Y have joint cdf F (x, y). Then X and Y are said to be independent if and only if F (x, y) = FX (x)FY (y) for every pair of real numbers (x, y). If X and Y are not independent, they are said to be dependent. 27 / 65 Independence Theorem 4: Random variables X and Y are independent if and only if for all pairs of real numbers (x, y), f (x, y) = fX (x)fY (y), Note: For discrete random variables, f (x, y) is the joint pmf, fX (x) and fY (y) are the marginal pmf. For continuous random variables, f (x, y) is the joint pdf, fX (x) and fY (y) are the marginal pdf. 28 / 65 Independence If f (x|y) = f (x) or f (y|x) = f (y) for all pairs of real numbers (x, y), then X and Y are independent. If f (x|y) ̸= f (x) or f (y|x) ̸= f (y) for some pair of real numbers (x, y), then X and Y are dependent. Note: For discrete random variables, f (x|y) and f (y|x) is the conditional pmf, fX (x) and fY (y) are the marginal pmf. For continuous random variables, f (x|y) and f (y|x) is the conditional pdf, fX (x) and fY (y) are the marginal pdf. 29 / 65 Independence The definition of independence can be generalized to n dimensions. Suppose that we have n random variables, Y1 , · · · , Yn , where Yi has cdf Fi (yi ), for i = 1, 2, · · · , n; and where Y1 , · · · , Yn have joint cdf F (y1 , y2 , · · · , yn ). Then Y1 , · · · , Yn are independent if and only if F (y1 , y2 , · · · , yn ) = F1 (y1 ) · · · Fn (yn ) or equivalently f (y1 , y2 , · · · , yn ) = f1 (y1 )f2 (y2 ) · · · fn (yn ), for all real numbers y1 , y2 , · · · , yn , where fi (yi ) is the marginal pmf or pdf of Yi . 30 / 65 Independence Example 9: Let f (x, y) = 6xy 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Show that X and Y are independent. Solution: It is easy to get R1 fX (x) = 0 6xy 2 dy = 2xy 3 |10 = 2x, 0 ≤ x ≤ 1 R1 fY (y) = 0 6xy 2 dx = 3x2 y 2 |10 = 3y 2 , 0 ≤ y ≤ 1. Hence, f (x, y) = fX (x)fY (y) for all real numbers (x, y). Therefore, X and Y are independent. 31 / 65 Expected Value and Covariance 32 / 65 Expected Value of a Function of Multivariate R.V. Let g(Y) = g(Y1 , · · · , Yn ) be a function of discrete multivariate random variable Y = (Y1 , · · · , Yn )⊤ which has pmf f (y) = f (y1 , · · · , yn ). Then the expected value of g(Y) is E[g(Y)] = X X g(y)f (y) = all y all yn ··· X g(y1 , · · · , yn )f (y1 , · · · , yn ) all y1 If Y1 , · · · , Yn are continuous random variables with joint pdf f (y1 , y2 , · · · , yn ), then Z E[g(Y)] = g(y)f (y)dy Z ∞ Z ∞ = ··· g(y1 , · · · , yn )f (y1 , · · · , yn )dy1 · · · dyn . −∞ −∞ 33 / 65 Expected Value of a Function of Multivariate R.V. Note: Suppose Y1 , · · · , Yn are continuous random variables and we wish to find the expected value of g(Y1 , · · · , Yn ) = Y1 , we have Z ∞ E(Y1 ) = = Z ∞ ··· y1 f (y1 , · · · , yn )dy1 · · · dyn −∞ −∞ Z ∞ Z ∞ Z ∞ f (y1 , · · · , yn )dy2 · · · dyn dy1 ··· y1 −∞ −∞ −∞ by definition of marginal pdf Z ∞ = y1 f1 (y1 )dy1 −∞ which agrees with the definition in univariate case. 34 / 65 Expected Value of a Function of Multivariate R.V. Example 10: The joint and marginal pmfs of X and Y are given below. Please find the expected value for XY, X, Y . f (x, y) y=0 y=1 fX (x) x=0 1/2 1/3 5/6 1/6 1/6 x=1 fY (y) 1/2 1/2 Solution: P E(XY ) = all x,y xyf (x, y) = 0×f (0, 0)+0×f (0, 1)+1×f (1, 1) = 1/6, P E(X) = all x,y xf (x, y) = 0 × f (0, 0) + 0 × f (0, 1) + 1 × f (1, 1) = 1/6, P E(Y ) = all x,y yf (x, y) = 0 × f (0, 0) + 1 × f (0, 1) + 1 × f (1, 1) = 1/2. 35 / 65 Expected Value of a Function of Multivariate R.V. Example 11: Suppose that X and Y are two continuous random variables with joint pdf f (x, y) = 81 xy, 0 < x < 2y < 4. Find the expected value of XY , X and Y , respectively. Solution: R 2 R 2y R2 3 1 6 2 32 E(XY ) = 0 0 xy 81 xydxdy = 18 0 y 2 x3 |2y 0 dy = 18 y |0 = 9 , R 2 R 2y R2 3 1 5 2 32 E(X) = 0 0 x 18 xydxdy = 18 0 y x3 |2y 0 dy = 15 y |0 = 15 , R 2 R 2y 1 R 2 2 1 5 2 8 E(Y ) = 0 0 y 8 xydxdy = 18 0 y 2 x2 |2y 0 dy = 20 y |0 = 5 . 36 / 65 Properties of Expected Values Theorem 5: Let a, b be constants and let g(Y1 , · · · , Yn ), g1 (Y1 , · · · , Yn ), g2 (Y1 , · · · , Yn ), · · · , gk (Y1 , · · · , Yn ) be functions of Y1 , · · · , Yn . Then the following results hold: 1. E[ag(Y1 , · · · , Yn ) + b] = aE[g(Y1 , · · · , Yn )] + b. 2. E[g1 (Y1 , · · · , Yn ) + · · · + gk (Y1 , · · · , Yn )] = E[g1 (Y1 , · · · , Yn )] + · · · + E[gk (Y1 , · · · , Yn )]. Proof: Similar to the proof of the expected values in univariate case. 37 / 65 Properties of Expected Values Theorem 6: Let X and Y be independent random variables and g(X) and h(Y ) be functions of only X and Y , respectively. Then E[g(X)h(Y )] = E[g(X)]E[h(Y )]. Proof: Here we show the continuous case. The discrete case can be proved similarly. Z ∞Z ∞ E[g(X)h(Y )] = g(x)h(y)f (x, y)dxdy −∞ −∞ Z ∞Z ∞ g(x)h(y)fX (x)fY (y)dxdy (by independence) Z ∞ g(x)fX (x)dx h(y)fY (y)dy = −∞ −∞ Z ∞ = −∞ −∞ = E[g(X)]E[h(Y )]. Note: Random variables X and Y could be multivariate. 38 / 65 Properties of Expected Values Example 12: Let f (x, y) = 6xy 2 , 0 ≤ x ≤ 1, 0 ≤ y ≤ 1. Find the expected value of XY . Solution: We have showed that X and Y are independent and known fX (x) = 2x, 0 ≤ x ≤ 1 fY (y) = 3y 2 , 0 ≤ y ≤ 1. Then, it is easy to get E(X) = R1 E(Y ) = 0 3y 3 dx = 3/4. R1 0 2x2 dx = 2/3 and Hence, E(XY ) = E(X)E(Y ) = 1/2. 39 / 65 Covariance If X and Y are random variables with means µX and µY , respectively, the covariance of X and Y is Cov(X, Y ) = E[(X − µX )(Y − µY )]. Note: Positive values indicate that X increases as Y increases; negative values indicate that X decreases as Y increases. A zero value of the covariance indicates that the variables are uncorrelated and that there is no linear dependence between X and Y . 40 / 65 Correlation Coefficient The value of variance depends on the scale of variables so it is difficult to determine at first glance whether a particular covariance is large or small. This problem can be eliminated by using the correlation coefficient, ρ, a quantity related to the covariance and defined as Note: ρ= Cov(X, Y ) Cov(X, Y ) = p σX σY V ar(X)V ar(Y ) The sign of ρ is same as the sign of the covariance and the range is −1 ≤ ρ ≤ 1. ρ = 1 implies perfect positive linear correlation, with all points falling on a straight line with positive slope. ρ = −1 implies perfect negative linear correlation, with all points falling on a straight line with negative slope. ρ = 0 implies zero covariance and no correlation. 41 / 65 Covariance Theorem 7: If X and Y are random variables with means µX and µY , respectively, then Cov(X, Y ) = E[(X − µX )(Y − µY )] = E(XY ) − E(X)E(Y ). Theorem 8: Let a, b be constants and X, Y be random variables. Then, Cov(a + bX, c + dY ) = bdCov(X, Y ). Theorem 9: If X and Y are independent random variables, then Cov(X, Y ) = 0. 42 / 65 Covariance Example 13: Suppose that X and Y are two continuous random variables with joint pdf f (x, y) = 81 xy, 0 < x < 2y < 4. Find the covariance of X and Y . Solution: 32 8 We have known that E(XY ) = 32 9 , E(X) = 15 and E(Y ) = 5 in Example 11. 32 8 96 Hence Cov(X, Y ) = E(XY ) − E(X)E(Y ) = 32 9 − 15 5 = 675 . 43 / 65 Expected Value of a Vector and Matrix Let   Y1    .  Y =  ..  ,   Yn  X11   .. X= .  Xn1 ··· .. . ···  X1k  ..  . .   Xnk The expected value is fined to be   E(Y1 )    .  E(Y) =  ..  ,   E(Yn )  E(X11 ) · · ·  ..  .. E(X) =  . .  E(Xn1 ) · · ·  E(X1k )  ..  . .  E(Xnk ) 44 / 65 Covariance Matrix   Y1    .  Let Y =  ..  ,   Yn   U1    .  U =  ..  . The covariance matrix is defined to be   Uk  Cov(Y1 , U1 ) · · ·  ..  .. ⊤ Cov(Y, U) = E (Y − E(Y))(U − E(U)) = . .  ··· Cov(Yn , U1 )    Cov(Y) = E (Y − E(Y))(Y − E(Y))⊤ =   V ar(Y1 ) .. . ··· Cov(Yn , Y1 ) ··· .. .  Cov(Y1 , Uk )  ..  , .  Cov(Yn , Uk )  Cov(Y1 , Yn )  ..  . .  V ar(Yn ) 45 / 65 Expected Value and Variance of Multivariate R.V. Theorem 10: For any m × n constant matrix A, l × k constant matrix B, m × 1 constant vector c, l × 1 constant vector d, multivariate random variables Y = (Y1 , · · · , Yn )⊤ and U = (U1 , · · · , Uk )⊤ , we have E(AY + c) = AE(Y) + c, Cov(AY + c, BU + d) = ACov(Y, U)B ⊤ , Cov(AY + c) = ACov(Y)A⊤ . 46 / 65 Expected Value and Variance of Multivariate R.V. Corollary of Theorem 10: Let Y1 , Y2 , · · · , Yn and X1 , X2 , · · · , Xk be random variables. Define U1 = n X i=1 ai Yi , and U2 = k X bj Xj j=1 for constants a1 , a2 , · · · , an and b1 , b2 , · · · , bk . Then the following hold: Pn (1) E(U1 ) = i=1 ai E(Yi ). Pn P (2) V ar(U1 ) = i=1 a2i V ar(Yi ) + 2 1≤i 0 for i = 1, 2, · · · , k. The random variables Y1 , Y2 , · · · , Yk , are said to have a multinomial distribution with parameters n and p1 , p2 , · · · , pk if the joint probability function of Y1 , Y2 , · · · , Yk is given by f (y1 , y2 , · · · , yk ) = n! py1 py2 · · · pykk y1 !y2 ! · · · yk ! 1 2 where for each i, yi = 0, 1, 2, · · · , n and Pk i=1 yi = n. We write Y1 , Y2 , · · · , Yk ∼ M N (n, p1 , p2 , · · · , pk ). Note: Binomial distribution is a special case of the multinomial distribution with k = 2. 57 / 65 Multinomial Distribution Theorem 12: If Y1 , Y2 , · · · , Yk have a multinomial distribution with parameters n and p1 , p2 , · · · , pk , then 1. The marginal distribution of Yi is Bin(n, pi ) so that E(Yi ) = npi , V ar(Yi ) = npi (1 − pi ). 2. Cov(Ys , Yt ) = −nps pt if s ̸= t. Note: Recall that Yi is the number of trials falling into cell i. Imagine all of the cells, excluding cell i, combined into a single large cell. Then every trial will result in cell i or in a cell other than cell i, with probabilities pi and 1 − pi , respectively. Thus, Yi ∼ Bin(n, pi ). The covariance is negative, which is to be expected because a large number of outcomes in cell s would force the number in cell t to be small. 58 / 65 Multinomial Distribution Example 16: A fair die is rolled 10 times independently. What’s the probability that three even numbers and two ones comp up? Please calculate the correlation of the number of even numbers and the number of ones. Solution: Define Y1 =“the number of even numbers”, Y2 =“the number of ones”, Y3 =“the number of threes or fives” (i.e., other outcome). Then, Y1 , Y2 , Y3 ∼ M N (10, 1/2, 1/6, 1/3). Then we have 10 1 3 1 2 1 5 P (Y1 = 3, Y2 = 2, Y3 = 5) = f (3, 2, 5) = 3!2!5! = 0.0360 2 6 3 and the correlation is Cov(Y1 , Y2 ) −np1 p2 −p1 p2 ρY1 ,Y2 = p = p = p V ar(Y1 )V ar(Y2 ) np1 (1 − p1 )np2 (1 − p2 ) p1 (1 − p1 )p2 (1 − p2 ) = −q 1 × 61 2 1 × 12 × 16 × 56 2 1 = − √ = −0.4472. 5 59 / 65 Multivariate Normal Distribution 60 / 65 Multivariate Normal Distribution Univariate normal distribution: Y ∼ N (µ, σ 2 ) n o 2 1 exp − (y−µ) f (y) = √2πσ 2 2σ  Y1    µ1   σ12  ∼ N µ =   , Σ =  Y2 µ2 σ21 where E(Yi ) = µi , Cov(Y1 , Y2 ) = σ12 = σ21 = ρσ1 σ2 .   ⊤      y − µ1 y − µ1  1 1  1 −1  1   Σ f (y1 , y2 ) = 2π|Σ|1/2 exp − 2    y2 − µ2 y2 − µ2  o n 1√ exp − Q = 2 , 2 Bivariate normal distribution:  2πσ1 σ2 σ12 σ22   , 1−ρ where Q= 1 (y1 − µ1 )2 (y1 − µ1 )(y2 − µ2 ) (y2 − µ2 )2 − 2ρ + . 1 − ρ2 σ12 σ1 σ2 σ22 61 / 65 Multivariate Normal Distribution Multivariate normal distribution:       Y1 µ1 σ2       1  .    .   . Y =  ..  ∼ N µ =  ..  , Σ =  ..       Yn µn σn1 E(Yi ) = µi , Cov(Yi , Yj ) = σij = σji . ··· .. . ···  σ1n  ..   , where .   σn2 1 ⊤ −1 exp − (y − µ) Σ (y − µ) . f (y) = f (y1 , · · · , yn ) = 2 (2π)n/2 |Σ|1/2 1 62 / 65 Properties of Multivariate Normal Distribution 1. If Y ∼ N (µ, Σ), then E(Y) = µ, Cov(Y) = Σ. 2. If the correlation coefficient or covariance of Y1 ∼ N (µ1 , σ12 ) and Y2 ∼ N (µ2 , σ22 ) are zero, Y1 and Y2 are independent. 3. For any k × n constant matrix A, k × 1 and constant vector b, if Y ∼ N (µ, Σ), then AY + b ∼ N (Aµ + b, AΣA⊤ ). 63 / 65 Properties of Multivariate Normal Distribution 4. Any finite dimensional random vector selected from Y has a multivariate normal distribution. For example, Y1 ∼ N (µ1 , σ12 ),   Y1 Y2  Yt    ∼ N    µ1 µ2 µt   σ12 , σ21   σt2   ∼ N   ,  Ys µs σst σ12 σ22 σts σs2   ,   when t ̸= s. 64 / 65 Conclusion • Know how to obtain joint, marginal and conditional pmf and pdf. • Be able to calculate expected value, covariance, correlation coefficient, conditional expectation and be familiar with their properties. • Understand multinomial distribution and multivariate normal distribution and their properties. 65 / 65 Chapter 6: Functions of Random Variables STAT6039 Principles of Mathematical Statistics Functions of Discrete Random Variables Example 1: A coin is tossed twice. Let Y be the number of heads that come up. Find the distribution of X = 3Y − 1.    1/4,    Solution: We know Y ∼ Bin(2, 0.5) so its pmf is fY (y) = 1/2,      1/4, If y = 0, then x = 3 × 0 − 1 = −1. y = 0, y = 1, y = 2. If y = 1, then x = 3 × 1 − 1 = 2. If y = 2, then x = 3 × 2 − 1 = 5.   1/4, x = −1    So the pmf of X is fX (x) = 1/2, x = 2 (the same probabilities but     1/4, x = 5 different values). 1 / 42 Functions of Discrete Random Variables Note that previous example is an one-to-one correspondence between x and y values. This made the solution fairly easy. The following is a more general result. General Result: Suppose that Y is a discrete random variable and X is a function of Y , denoted by X = g(Y ). Then X is a discrete random variable with pmf given by fX (x) = X fY (y). y:g(y)=x 2 / 42 Functions of Discrete Random Variables Example 2: Let Y ∼ Bin(2, 0.5). Find the pmf of U = (Y − 1)2 . Solution: The correspondence is as follows: If y = 1, then u = 0. If y = 0 or 2, then u = 1. The pmf of U is P fU (0) = y:g(y)=0 fY (y) = fY (1) = 1/2, P fU (1) = y:g(y)=1 fY (y) = fY (0) + fY (2) = 1/4 + 1/4 = 1/2.   1/2 u = 0, In summary, the pmf of U is fU (u) = i.e.  1/2 u = 1, U ∼ Bern(0.5). 3 / 42 Functions of Discrete Random Variables Example 3: We roll two dice. Find the pmf of the absolute difference between the two numbers that come up? Solution: Let Yi be the number which comes up on the ith die (i = 1, 2). We wish to find the pmf of the absolute difference between Y1 and Y2 , namely D = |Y1 − Y2 |. So fD (d) = X f (y1 , y2 ) y1 ,y2 :g(y1 ,y2 )=d where f (y1 , y2 ) = 1/36, y1 , y2 ∈ {1, 2, · · · , 6}. 4 / 42 Functions of Discrete Random Variables Solution (continued): It is convenient to do this calculation graphically, as follows. Hence, the pmf of D is fD (d) =   6/36        10/36       8/36   6/36        4/36       2/36 d = 0, d = 1, d = 2, d = 3, d = 4, d = 5. 5 / 42 Functions of Continuous Random Variables There are three main strategies for the continuous case. • the cdf method • the transformation method • the mgf method. 6 / 42 The CDF Method 7 / 42 The CDF Method Let U = g(Y1 , Y2 , · · · Yn ) be a function of the continuous random variables Y1 , Y2 , · · · , Yn . The pdf of U can be found by the following steps. 1. Find the region U ≤ u in the (y1 , y2 , · · · , yn ) space. 2. Find FU (u) = P (U ≤ u) by integrating f (y1 , y2 , ..., yn ) over the region U ≤ u. 3. Find the probability density function fU (u) by differentiating FU (u). U (u) Thus, fU (u) = dFdu . 8 / 42 The CDF Method Example 4: Suppose that Y ∼ U (0, 2). Find the pdf of X = 3Y − 1. Solution: FX (x) = P (X ≤ x) = P (3Y − 1 ≤ x) = P Y ≤ x+1 . 3 If x < −1, then (x + 1)/3 < 0 so that FX (x) = P Y ≤ x+1 = 0. 3 x+1 If x > 5, then (x + 1)/3 > 2 so that FX (x) = P Y ≤ 3 = 1.
R (x+1)/3 1
x+1
If −1 ≤ x ≤ 5, FX (x) = P Y ≤ x+1
= 0
3
2 dy = 6 .


0
x < −1,    Thus, fX (x) = 1/6 −1 ≤ x ≤ 5,     0 x > 5.
So X ∼ U (−1, 5).
9 / 42
The CDF Method
Example 5: Suppose that X, Y ∼ iid U (0, 1) . Find the pdf of U = X + Y .
Solution: Since X and Y are independent, the joint pdf of X and Y is
f (x, y) = fX (x)fY (y) = 1, 0 < x < 1, 0 < y < 1. RR So the cdf of U is FU (u) = P (X + Y ≤ u) = 1dxdy. 0 u) = R1 R1 2 1 − u−1 u−y 1dxdy = − u2 + 2u − 1. The pdf of U can be obtained by differentiating FU (u). Thus,    u, 0 < u < 1,    fU (u) = 2 − u, 1 < u < 2,     0, otherwise. 11 / 42 The CDF Method Example 6: Find the pdf of U = g(Y ) = Y 2 , where Y is a continuous random variable with cdf FY (y) and pdf fY (y). Solution: If u ≤ 0, FU (u) = P (U ≤ u) = P (Y 2 ≤ u) = 0. √ √ If u > 0, FU (u) = P (U ≤ u) = P (Y 2 ≤ u) = P (− u < Y < u) = √ √ FY ( u) − FY (− u). Differentiating with respect to u, we have if u > 0,

√
√
fU (u) = fY ( u) 2√1 u − fY (− u) − 2√1 u .
To summarize,
 we get

 √1 {fY (√u) + fY (−√u)} ,
2 u
fU (u) =

0,
u > 0,
otherwise.
12 / 42
The Transformation Method
13 / 42
The Transformation Method
Let U = g(Y ), where g(y) is an increasing function of y for all such that
fY (y) > 0. Then we have
FU (u) = P (U ≤ u) = P (g(Y ) ≤ u) = P (Y ≤ g −1 (u)) = FY [g −1 (u)].
Thus,
fU (u) = fY [g −1 (u)]
d[g −1 (u)]
.
du
If g(y) is a decreasing function of y for all such that fY (y) > 0, we have
FU (u) = P (U ≤ u) = P (g(Y ) ≤ u) = P (Y ≥ g −1 (u)) =
1 − FY [g −1 (u)].
Thus,
fU (u) = −fY [g −1 (u)]
d[g −1 (u)]
.
du
14 / 42
The Transformation Method
Let U = g(Y ), where g(y) is either an increasing or decreasing function
of y for all such that fY (y) > 0.
1. Find the inverse function, y = g −1 (u).
−1
2. Evaluate d[g du(u)] .
3. Find fU (u) by
fU (u) = fY [g −1 (u)]
d[g −1 (u)]
.
du
Note: It is a “shortcut version” of the cdf method.
15 / 42
The Transformation Method
Example 7: Suppose that Y ∼ U (0, 2). Find the pdf of X = 3Y − 1.
Solution: Since Y ranges from 0 to 2, X = 3Y − 1 ranges from -1 to -5.
As X = g(Y ) = 3Y − 1, g(y) is an increasing function of y for all
0 < y < 2. −1 d[g (x)] It is easy to get y = g −1 (x) = x+1 = 13 . 3 and dx Since fY (y) = 21 , we have fX (x) = fY (g −1 (x)) dg −1 (x) 1 1 1 = = , −1 < x < 5. dx 2 3 6 So X ∼ U (−1, 5). 16 / 42 The Transformation Method Example 8: Suppose that Y ∼ N (µ, σ 2 ). Find the pdf of Z = Y σ−µ . Solution: Since Z = Y σ−µ , we know z = g(y) = y−µ σ is an increasing function of y for all −∞ < y < ∞. −1 Also it is easy to get y = g −1 (z) = σz + µ and d[g dz(z)] = σ. (y−µ)2 1 Since fY (y) = √2πσ e− 2σ2 , −∞ < y < ∞, we have fZ (z) = fY (g −1 (z)) = = dg −1 (z) dz (σz+µ−µ)2 1 e− 2σ2 × |σ| 2πσ z2 1 √ e− 2 , −∞ < z < ∞. 2π √ So Z ∼ N (0, 1). 17 / 42 The MGF Method 18 / 42 The MGF Method Theorem 1 (Uniqueness of MGF): Let mX (t) and mY (t) denote the moment generating functions of random variables X and Y , respectively. If both moment generating functions exist and mX (t) = mY (t) for all values of t, then X and Y have the same probability distribution. Note: The proof is beyond the scope of this course. 19 / 42 The MGF Method Let U be a function of the random variables Y1 , Y2 , · · · , Yn . 1. Find the moment generating function of U , mU (t) = E(etU ). 2. Compare mU (t) with other well-known moment generating functions. If mU (t) = mV (t) for all values of t, Theorem 1 implies that U and V have identical distributions. 20 / 42 The MGF Method Example 9: Find the probability distribution of U = Z 2 , where Z ∼ N (0, 1). Solution: mU (t) tU tZ 2 Z ∞ 2 tz 2 e − z2 = E(e ) = E(e ) = e √ dz 2π −∞ Z ∞ 1 − z2 (1−2t) √ e 2 = dz 2π −∞ Z ∞ z2 1 √ e− 2σ2 dz = where σ 2 = (1 − 2t)−1 , t < 1/2 2π −∞ Z ∞ z2 1 √ = σ e− 2σ2 dz 2πσ −∞ = σ (since the last integral equals 1). Thus, mU (t) = (1 − 2t)−1/2 . Comparing this mgf with those well-known mgfs, we can get U ∼ χ2 (1), i.e. G(α = 1/2, β = 2). 21 / 42 The MGF Method Theorem 2 : 1. If X = a + bY , then mX (t) = eat mY (bt). 2. Let Y1 , Y2 , · · · , Yn be independent random variables and U = Y1 + Y2 + · · · + Yn , then mU (t) = n Y mYi (t) = mY1 (t) × mY2 (t) × · · · × mYn (t). i=1 22 / 42 The MGF Method Example 10: Find the probability distribution of Y = µ + σZ, where Z ∼ N (0, 1). 1 2 Solution: Since Z ∼ N (0, 1), we have mZ (t) = e 2 t . So mY (t) = eµt mZ (σt) 1 2 1 2 2 = eµt e 2 (σt) = eµt+ 2 σ t . This mgf mY (t) is just the mgf of N (µ, σ 2 ). Hence, Y ∼ N (µ, σ 2 ). 23 / 42 The MGF Method Example 11: Suppose that Y1 , Y2 , · · · , Yn are independent normally distributed random variables with parameters µi and σi2 , respectively. Find Pn the distribution of X = i=1 ai Yi (a linear combination). Solution: 1 2 2 For each i = 1, · · · , n, the mgf function of Yi is mYi (t) = eµi t+ 2 σi t . Then, mX (t) = ma1 Y1 (t)ma2 Y2 (t) · · · man Yn (t) = mY1 (a1 t)mY2 (a2 t) · · · mYn (an t) 1 2 2 1 2 2 1 2 2 = ea1 µ1 t+ 2 σ1 (a1 t) ea2 µ2 t+ 2 σ2 (a2 t) · · · ean µn t+ 2 σn (an t) Pn = e( Pn 1 i=1 ai µi )t+ 2 ( 2 2 2 i=1 ai σi )t . Pn Pn Therefore, X ∼ N ( i=1 ai µi , i=1 a2i σi2 ). 24 / 42 The MGF Method Example 12: Suppose that Y1 , Y2 , · · · , Yn are independent gamma distribution variables with parameters αi and β, respectively. Find the Pn distribution of X = i=1 Yi = Y1 + Y2 + · · · + Yn . Solution: For each i = 1, · · · , n, the mgf function of Yi is mYi (t) = (1 − βt)−αi . Then, mX (t) = mY1 (t) × mY2 (t) × · · · × mYn (t) = (1 − βt)−α1 (1 − βt)−α2 · · · (1 − βt)−αn = (1 − βt)− Therefore, X ∼ G( Pn i=1 αi . Pn i=1 αi , β). 25 / 42 The MGF Method Important Properties: 1. If Yi ∼ χ2 (ni ), i = 1, · · · , n and all Yi s are independent, then Pn Pn 2 2 i=1 ni ). For example, if Yi ∼ iid χ (1), i = 1, · · · , n, i=1 Yi ∼ χ ( Pn then i=1 Yi ∼ χ2 (n). 2. If Yi ∼ iid Exp(β), then Pn i=1 Yi ∼ G(n, β). 3. Let Yi ∼ N (µi , σi2 ), i = 1, · · · , n and assume all Yi s are independent. Define Zi = Then Yi − µi , i = 1, · · · , n. σi Pn 2 2 i=1 Zi ∼ χ (n). 26 / 42 Order Statistics 27 / 42 Order Statistics Let Y1 , Y2 , · · · , Yn denote independent continuous random variables with common cdf F (y) and pdf f (y). We denote the ordered random variables of {Yi , i = 1, · · · , n} by Y(1) , Y(2) , · · · , Y(n) , where Y(1) ≤ Y(2) ≤ · · · ≤ Y(n) . (Because the random variables are continuous, the equality signs can be ignored.) Using this notation, Y(1) = min(Y1 , Y2 , · · · , Yn ) is the minimum of the random variables {Yi , i = 1, · · · , n}, and Y(n) = max(Y1 , Y2 , · · · , Yn ) is the maximum of the random variables {Yi , i = 1, · · · , n}. We call Y(k) the kth order statistic. 28 / 42 Order Statistics The pdf of Y(1) and the pdf of Y(n) can be found using the cdf method. FY(n) (y) = P (Y(n) ≤ y) FY(1) (y) = 1 − P (Y(1) > y)
=
P (Y1 ≤ y, · · · , Yn ≤ y)
=
P (Y1 ≤ y)P (Y2 ≤ y) · · · P (Yn ≤ y)
=
[F (y)]n
=
1 − P (Y1 > y, · · · , Yn > y)
=
1 − P (Y1 > y)P (Y2 > y) · · · P (Yn > y)
=
1 − [1 − F (y)]n
The pdf of Y(n) is fY(n) (y) = n[F (y)]n−1 f (y).
The pdf of Y(1) is fY(1) (y) = n[1 − F (y)]n−1 f (y).
29 / 42
Order Statistics
Let us now consider the case n = 2. So we only have two random
variables Y1 and Y2 . We would like to find the joint pdf of Y(1) and Y(2) . For
any y1 ≤ y2 ,
FY(1) ,Y(2) (y1 , y2 )
=
P [(Y1 ≤ y1 , Y2 ≤ y2 ) ∪ (Y2 ≤ y1 , Y1 ≤ y2 )]
=
P (Y1 ≤ y1 , Y2 ≤ y2 ) + P [(Y2 ≤ y1 , Y≤ y2 ) − P [(Y1 ≤ y1 , Y2 ≤ y1 )
=
F (y1 )F (y2 ) + F (y2 )F (y1 ) − F (y1 )F (y1 )
=
2F (y1 )F (y2 ) − [F (y1 )]2 .
The joint pdf of Y(1) and Y(2) can be obtained by differentiating
FY(1) ,Y(2) (y1 , y2 ) first with respect to y2 and then with respect to y1 , which is
fY(1) ,Y(2) (y1 , y2 ) = 2f (y1 )f (y2 ),
y1 ≤ y2 .
30 / 42
Order Statistics
Theorem 3: Let Y1 , . . . , Yn be independent identically distributed
continuous random variables with cdf F (y) and pdf f (y). Then the pdf of
Y(k) is given by
n!
fY(k) (yk ) = (k−1)!(n−k)!
[F (yk )]k−1 [1 − F (yk )]n−k f (yk ) , −∞ < yk < ∞ . And the joint pdf of Y(j) and Y(k) (0 ≤ j < k ≤ n) is fY(j) ,Y(k) (yj , yk ) = n! [F (yj )]j−1 (j − 1)!(k − 1 − j)!(n − k)! × [F (yk ) − F (yj )]k−1−j × [1 − F (yk )]n−k f (yj ) f (yk ) , − ∞ < yj < yk < ∞. The joint pdf of Y(1) , . . . , Y(n) is fY(1) ,··· ,Y(n) (y1 , · · · , yn ) = n!f (y1 ) · · · f (yn ), −∞ < y1 < · · · < yn < ∞. 31 / 42 Order Statistics Example 13: Electronic components of a certain type have a length of life Y ∼ Exp(100), measured in hours. Suppose that two components operate independently and in series in a certain system (hence, the system fails when either component fails). Find the probability distribution of X, the length of life of the system. Solution: Because the system fails at the first component failure, X = min(Y1 , Y2 ), where Y1 and Y2 are independent random variables with the same pdf f (y) = (1/100)e−y/100 and cdf F (y) = 1 − e−y/100 , y > 0.
Then,
fX (y) = fY(1) (y)
= n[1 − F (y)]n−1 f (y)
=
2e−y/100 (1/100)e−y/100
=
(1/50)e−y/50 ,
y > 0.
Hence, X ∼ Exp(50).
32 / 42
Order Statistics
Example 14: Suppose that the components in Example 13 operate in
parallel (hence, the system does not fail until both components fail). Find the
pdf of X, the length of life of the system.
Solution: Now X = max(Y1 , Y2 ). Then
fX (y) = fY(2) (y)
= n[F (y)]n−1 f (y)
=
2(1 − e−y/100 )(1/100)e−y/100
=
(1/50)(e−y/100 − e−y/50 ),
y > 0.
Hence, the maximum of two exponential random variables is not an
exponential random variable.
33 / 42
Order Statistics
Example 15: Suppose that Y1 , · · · , Y5 ∼ iid U (0, 1). Find the pdf of Y(2) .
Also, give the joint pdf of Y(2) and Y(4) .
Solution: Since Y1 , · · · , Y5 ∼ iid U (0, 1), we have f (y) = 1, 0 < y < 1 and F (y) = y, 0 < y < 1. The pdf of Y(2) can be obtained directly from Theorem 3 with n = 5, k = 2. So fY(k) (yk ) 5! 2−1 5−2 [F (y2 )] [1 − F (y2 )] f (y2 ) (2 − 1)!(5 − 2)! = 20y2 (1 − y2 )3 , 0 < y2 < 1. = Hence, Y(2) ∼ Beta(2, 4). In general, the kth-order statistic of Y1 , · · · , Yn ∼ iid U (0, 1) has a beta distribution with α = k and β = n − k + 1. 34 / 42 Order Statistics Solution (continued): The joint pdf of Y(2) and Y(4) can also be obtained from Theorem 3 with n = 5, j = 2, k = 4. So it has the form of fY(2) ,Y(4) (y2 , y4 ) = 5! 2−1 [F (y2 )] (2 − 1)!(4 − 1 − 2)!(5 − 4)! 4−1−2 = 5−4 × [F (y4 ) − F (y2 )] × [1 − F (y4 )] 120y2 (y4 − y2 )(1 − y4 ), 0 < y2 < y4 < 1. f (y2 ) f (y4 ) This joint density can be used to evaluate joint probabilities about Y(2) and Y(4) or to evaluate the expected value of functions of these two variables. 35 / 42 Range Restricted Distributions 36 / 42 Range Restricted Distributions Suppose we have a random variable of Y . Some restrictions are put on the range of Y , then the new random variable X has a range restricted distribution. Example 16: Suppose that the number of accidents which occur each year at a certain intersection follows a Poisson distribution with mean λ. Find the pmf and expectation of the number of accidents at this intersection last year if it is known that at least one accident occurred at the intersection during that year. 37 / 42 Range Restricted Distributions Solution: Let Y be the number of accidents at the intersection last year so that Y ∼ P oi(λ). Then the random variable of interest is X = (Y |Y > 0),
and the pmf of X is
fX (x)
= P (X = x) = P (Y = x|Y > 0)
P (Y = x, Y > 0)
=
P (Y > 0)
P (Y = x)
=
, x = 1, 2, · · ·
1 − P (Y = 0)
λx e−λ /x!
=
, x = 1, 2, · · ·
1 − e−λ
Also, it is easy to know 0 = fX (0) < fY (0) = e−λ . Since 1 − e−λ < 1, so fX (x) > fY (x), x = 1, 2, · · · .
38 / 42
Range Restricted Distributions
Solution (continued):
E(Y |Y > 0)
∞
X
λx e−λ /x!
x
1 − e−λ
x=1
=
E(X) =
=
∞
X
1
λx e−λ
x
1 − e−λ x=0
x!
=
=
(where the first term in the sum is 0)
1
E(Y )
1 − e−λ
λ
.
1 − e−λ
So E(X) > E(Y ).
39 / 42
Range Restricted Distributions
Example 17: Y ∼ P oi(λ). Find the pmf of X = (Y |Y > 1).
Solution: Using the same logic as in Example 16, we get
fX (x) =
λx e−λ /x!
,
1 − e−λ − λe−λ
x = 2, 3 · · ·
Example 18: Y ∼ P oi(λ). Find the
 pmf of X = Y I(Y
 > 1).
Solution: Here X = Y × I(Y > 1) =

Y × 0 if Y ≤ 1

Y × 1 if Y > 1
=

0
if Y ≤ 1,

Y
if Y > 1.
≤ 1) = e−λ + λe−λ , we have
Since P
(X = 0) = P (Y

e−λ + λe−λ x = 0,
fX (x) =

 λx e−λ
x = 2, 3, · · · .
x!
Note: These two kinds of range restrictions here are different.
40 / 42
Range Restricted Distributions
Example 19: Z ∼ N (0, 1). Find the pdf of X = (Z|Z > 0).
1
2
Z (x)
Solution: fX (x) = Pf(Z>0)
= 2ϕ(x) = √22π e− 2 x , x > 0.
Example 20: Z ∼ N (0, 1). Find thepdf of X = ZI(Z> 0).
Solution: Here X = Z × I(Z > 0) =

Z × 0 if Z ≤ 0

Z × 1 if Z > 0
=

0
if Z ≤ 0,

Z
if Z > 0.
Since P (Z≤ 0) = 1/2, we have

1/2
x = 0 (discrete),
fX (x) =
2

ϕ(x) = √1 e− 12 x x > 0 (continuous).
2π
So X here has a mixed distribution.
Note: We could use notation X = max(0, Z) in Example 20.
41 / 42
Conclusion
• Know when and how to use those three methods to derive the
distribution of functions of random variables.
• Can obtain the marginal or joint distribution of order statistics by using
the formulae.
• Be able to get the range restricted distributions.
• Be familiar with those important results, like properties of gamma
distribution, normal distributions, chi-square distributions and so on.
42 / 42
Chapter 7: Sampling Distributions and
the Central Limit Theorem
STAT6039 Principles of Mathematical Statistics
Population and Random Sample
Population:
• Every observation of interest available in the physical world.
• In most cases, we are interested in some unknown population
parameters, e.g., population mean µ, population variance σ 2 , etc.
Random Sample:
• A selection of observations drawn randomly from the population of
interest.
• Denoted as Y1 , · · · , Yn . Assume they are independent and identically
distributed (iid) random variables through this course.
• We can use sample mean Ȳ =
2
S =
Pn
i=1 Yi
n
2
2
i=1 (Yi −Ȳ )
to
estimate
σ
.
n−1
to estimate µ, sample variance
Pn
1 / 45
Statistics
A statistic is a function of the observable random variables in a sample
and known constants.
• Sample mean Ȳ is a function of the random variables Y1 , · · · , Yn and
the (constant) sample size n so it is a statistic.
• Other examples: sample variance S 2 =
Pn
2
i=1 (Yi −Ȳ )
n−1
, order statistics
Y(n) = max(Y1 , · · · , Yn ), Y(1) = min(Y1 , · · · , Yn ), range Y(n) − Y(1) ,
etc.
•
Pn
2
i=1 (Yi − µ) is not a statistic since there is an unknown parameter µ.
2 / 45
Sampling Distributions
Normally we use these statistics to make inferences about population
parameters. All statistics are functions of the random variables observed in a
sample. It means if we take a new sample, we would probably get a different
value of statistics. Therefore, all statistics are random variables. There is
variability associated with them.
Consequently, all statistics have probability distributions, which we will
call their sampling distributions. Knowing the sampling distribution of a
statistic can tell us how they differ from sample to sample and how
accurately we are estimating a population parameter.
3 / 45
Sampling Distribution
Example 1: Suppose we roll a fair four-sided die and let Y be the number
that comes up. Let’s repeatedly draw samples of size 2, i.e., roll the die
twice. Find the sampling distribution of the sample mean Ȳ .
Solution: First, the following table lists the pmf of Y .
y
1
2
3
4
f (y)
1
4
1
4
1
4
1
4
We can calculate the population mean and population variance from the
probability distribution:
µ = E(Y ) = 52 and σ 2 = V ar(Y ) = 54 .
4 / 45
Sampling Distribution
Solution (continued): Then we calculate the sample mean for all possible
samples of size 2:
Second roll
1
2
3
4
1
ȳ = 1
ȳ = 1.5
ȳ = 2
ȳ = 2.5
First
2
ȳ = 1.5
ȳ = 2
ȳ = 2.5
ȳ = 3
roll
3
ȳ = 2
ȳ = 2.5
ȳ = 3
ȳ = 3.5
4
ȳ = 2.5
ȳ = 3
ȳ = 3.5
ȳ = 4
Note that each possible sample of size 2 has equal probability of being
drawn.
5 / 45
Sampling Distribution
Solution (continued): We can derive the sampling distribution of Ȳ from
the previous table:
ȳ
1
1.5
2
2.5
3
3.5
4
f (ȳ)
1
16
2
16
3
16
4
16
3
16
2
16
1
16
Because we could completely list all possible samples of size 2, this
sampling distribution could be determined exactly.
2
We can also get E(Ȳ ) = 25 = µ and V ar(Ȳ ) = 58 = σ2 .
6 / 45
Sampling Distribution
What happens when we can’t list all possible samples of a certain size or
the number of possible samples of a certain size is too large?
One approach is listed as follows.
• Randomly draw a sample of size n and calculate the sample statistic.
• Repeatedly conduct the above step m times and we can get m sample
statistics in total. For example, {Ȳ (k) , k = 1, · · · , m}.
• Draw a histogram of m sample statistics and the relative frequency can
be used to approximate the sampling distribution.
https://college.cengage.com/nextbook/statistics/
wackerly_966371/student/html/
7 / 45
Sampling Distribution
What we find from the experiment can be summarized below
• As sample size n increases, the histogram of all the observations in a
sample begins to look more like the distribution of the original
population.
• The sample mean of the sample means is very close to the population
mean µ and the sample variance of the sample means gets smaller as
the sample size n increases .
• As sample size n is increasing, the histogram of the sample means
looks more like a normal distribution. This phenomenon is due to the
Central Limit Theorem.
8 / 45
Properties of Sample Mean
Theorem 1: Suppose Y1 , · · · , Yn are iid with mean µ and finite variance σ 2 .
Then,
µȲ = E(Ȳ ) = µ
and
σȲ2 = V ar(Ȳ ) =
σ2
.
n
Proof:

Y1 + · · · + Yn
E(Y1 )
E(Yn )
µȲ = E(Ȳ ) = E
=
+ ··· +
= µ.
n
n
n

Y1 + · · · + Yn
2
σȲ = V ar(Ȳ ) = V ar
n
V ar(Y1 )
V ar(Yn )
=
+ ··· +
(by independence)
n2
n2
2
2
2
σ
σ
σ
=
+ ··· + 2 =
.
2
n
n
n
Note: More data we have, more accurate we are estimating the true
population mean µ by the sample mean Ȳ .
9 / 45
Sampling Distributions
Related to
Normal Distribution
10 / 45
Sampling Distribution of Sample Mean (known σ)
Theorem 2: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Then,
Pn

√
σ2
n(Ȳ − µ)
i=1 Yi
∼ N µ,
, i.e.
∼ N (0, 1).
Ȳ =
n
n
σ
Proof: We know Ȳ =
Pn
i=1 Yi
is a linear combination of independent
n
normally distributed random variables Y1 , · · · , Yn with weights being n1 .
According to the result of Example 11 in Chapter 6, we get Ȳ follows a
normal distribution with mean

Y1 + · · · + Yn
=µ
µȲ = E(Ȳ ) = E
n

Y1 + · · · + Yn
σ2
2
=
.
and variance σȲ = V ar(Ȳ ) = V ar
n
n

Thus,

Ȳ ∼ N
µ,
σ2
n
√

,
i.e.
n(Ȳ − µ)
∼ N (0, 1).
σ
11 / 45
Sampling Distribution of Sample Mean (known σ)
Example 2: A bottling machine can be regulated so that it discharges an
average of µ ml per bottle. It has been observed that the amount of fill
dispensed by the machine is normally distributed with σ = 1ml. A random
sample of n = 9 filled bottles is selected from the output of the machine.
Find the probability that the sample mean will be within 0.3ml of the true
mean µ for the chosen machine setting.
Solution: Let Yi be the volume of the ith bottle in the sample, i = 1, · · · , n,
where n = 9. Then Y1 , · · · , Yn ∼ iid N (µ, σ 2 ) where σ 2 = 1 and µ is
unknown.
So we have Ȳ ∼ N (µ, σ 2 /9).
12 / 45
Sampling Distribution of Sample Mean (known σ)
Solution (continued): Hence

P (|Ȳ − µ| ≤ 0.3)
=
0.3
Ȳ − µ
0.3
√ ≤
√
P − √ ≤
σ/ n
σ/ n
σ/ n

0.3
0.3
P − √ ≤Z≤ √
1/ 9
1/ 9
P (−0.9 ≤ Z ≤ 0.9)
=
1 − 2P (Z > 0.9)
=
1 − 2 × 0.1841
=
0.6318.
=
=

Note: This calculation does not depend on µ.
13 / 45
Sampling Distribution of Sample Mean (known σ)
Example 3: Refer to Example 2. How large n should be if we wish Ȳ to be
within 0.3ml of µ with probability 0.95?
Solution: Now we want
0.95 = P (|Ȳ − µ| ≤ 0.3)

0.3
Ȳ − µ
0.3
√ ≤
√
= P − √ ≤
σ/ n
σ/ n
σ/ n

√
√
= P −0.3 n ≤ Z ≤ 0.3 n
√
= 1 − 2P (Z > 0.3 n)
√
Thus, we need find n such that P (Z > 0.3 n) = 0.025. From the
z-table, we know P (Z > 1.96) = 0.025. So n = (1.96/0.3)2 = 42.68.
However, it is impossible to take a sample of size 42.68. Our solution
indicates that a sample of size 42 is not quite large enough to reach our
objective. So n should be 43 and P (|Y − µ| ≤ 0.3) will slightly exceed
0.95.
14 / 45
Sampling Distribution of Sample Variance
Theorem 3: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Define the sample
variance
2
Pn
S =
2
i=1 (Yi − Ȳ )
n−1
.
Then,
(a)
(n − 1)S 2
∼ χ2 (n − 1)
σ2
(b)
S 2 and Ȳ are independent.
Proof: The proof of this theorem is beyond the scope of this course.
15 / 45
Chi-square Table
Table 6 in the “statistical table” file lists the values of χ2α (m) such that
P (χ2 (m) > χ2α (m)) = α
for different α and degrees of freedom m.
So χ2α (m) is the upper α quantile of χ2 (m) or (lower) 1 − α quantile of
χ2 (m) since P (χ2 (m) ≤ χ2α (m)) = 1 − α.
16 / 45
Sampling Distribution of Sample Variance
Example 4: Refer to Example 2. If those 9 observations are used to
calculate S 2 , it might be useful to specify an interval of values that will
include S 2 with a high probability. Find numbers a and b such that
P (a < S 2 < b) = 0.9. Solution: By theorem 3, we get = P (a < S 2 < b) (n − 1)a (n − 1)S 2 (n − 1)b = P < < σ2 σ2 σ2 (n − 1)S 2 = P (8a < U < 8b), where U = ∼ χ2 (8). σ2 One method of doing this is to find the value of 8b that cuts off an area of 0.9 0.05 in the upper tail and the value of 8a that cuts off 0.05 in the lower tail (0.95 in the upper tail). 17 / 45 Sampling Distribution of Sample Variance Solution (continued): Therefore, 8b = χ20.05 (8) = 15.5073 and 8a = χ20.95 (8) = 2.73264. Then, we get a = 0.34158 and b = 1.93841. So the required interval is about (0.3416, 1.9384). Note: The required interval is not unique. Another interval is obtained by 0.9 = P (0 < U < 13.3616), where 13.3616 is χ20.1 (8). So we equate 13.3616 = 8b to get (0,1.6702). Or 0.9 = P (3.48954 < U < ∞) , where 3.48954 is χ20.9 (8). So we equate 3.48954 = 8a to get (0.4362, ∞) . The ideal case is to get a shortest interval such that P (a < s2 < b) = 0.9. However, the values of a and b cannot be obtained analytically and must be calculated numerically. Therefore, we prefer to use the interval that cuts off an area of 0.05 in both upper and lower tails. 18 / 45 Sampling Distribution of Sample Mean (unknown σ) √ Theorem 2 tells us that n(Ȳ −µ) ∼ N (0, 1) so it provides the basis for σ development of inference-making procedures about the mean µ of a normal population with known variance σ 2 . What if σ is unknown? It can be estimated by sample standard deviation S = √ √ S 2 . So we have n(Ȳ − µ) , S which can provide the basis for developing methods for inferences about µ if we know the distribution of this random variable. √ We can show that n(Ȳ −µ) has a distribution known as Student’s t S distribution with n − 1 degrees of freedom. 19 / 45 Student’s t Distribution Suppose Z ∼ N (0, 1) and U ∼ χ2 (m). Then, if Z and U are independent, Z T =p U/m is said to have a t distribution with m df. The pdf of T is − m+1 2 Γ m+1 y2 2 1+ , f (y) = √ m m πmΓ 2 −∞ < y < ∞. We may write T ∼ t(m) or T ∼ tm and f (y) as ft(m) (y). Pdf f (y) is symmetric about zero. m If m > 1, E(T ) = 0. If m > 2, V ar(T ) = m−2
.
20 / 45
Student’s t Distribution
The pdfs of N (0, 1) and t(m) are sketched in the following figure. Notice
that both density functions are symmetric about the origin but that the t
density has more probability mass in its tails or we say it has “fatter tails”
than the standard normal distribution.
When m → ∞, t(m) converges to N (0, 1).
21 / 45
Student’s t Distribution
Table 5 in the “statistical table” file lists the upper α quantiles tα (m) (i.e.,
1 − α quantiles) of t(m) for different α and degrees of freedom m, which
implies
P (T > tα (m)) = α,
i.e.,
P (T ≤ tα (m)) = 1 − α.
For example, t0.05 (8) = 1.860. Let T ∼ t(8). we have
P (T > 1.860) = 0.05, P (T ≤ 1.860) = 0.95, P (T ≤ −1.860) = 0.05 and
P (−1.860 < T < 1.860) = 0.90. 22 / 45 Sampling Distribution of Sample Mean (unknown σ) Theorem 4: Suppose Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). Then, √ n(Ȳ − µ) ∼ t(n − 1). T = S √ Proof: From theorem 2, we get Z = n(Ȳ −µ) ∼ N (0, 1). σ 2 By theorem 3, we have U = (n−1)S ∼ χ2 (n − 1). Also S 2 and Ȳ are σ2 independent so that Z and U are independent. By the definition of the t Z ∼ t(n − 1). U/(n−1) √ √ n(Ȳ −µ) Z n(Ȳ − µ) σ p = T, =q = 2 (n−1)S S U/(n − 1) (n−1)σ 2 distribution, we know √ Since √ we have T = n(Ȳ −µ) ∼ t(n − 1). S 23 / 45 Sampling Distribution of Sample Mean (unknown σ) Example 5: The tensile strength for a type of wire is normally distributed with unknown mean µ and unknown variance σ 2 . Six pieces of wire were randomly selected. Denote Yi to be the tensile strength for piece i. Because σȲ2 = σ 2 /n, so that σȲ2 can be estimated by S 2 /n , the estimated variance of √ µ. Find the approximate probability that Ȳ will be within 2S/ n of the true population mean µ. Solution: We want to find P 2S 2S − √ ≤ (Ȳ − µ) ≤ √ n n √ =P −2 ≤ n(Ȳ − µ) ≤2 S = P (−2 ≤ T ≤ 2) = 1 − 2P (T > 2)
where T ∼ t(n − 1) = t(5).
24 / 45
Sampling Distribution of Sample Mean (unknown σ)
Solution (continued):
Looking at the t-table, we see that P (T > 2.015) = 0.05. So the
probability that Ȳ will be within 2 estimated standard deviations of µ is
slightly less than 0.90.
If using R, we can get exact probability of interest which is 0.8981 by
command 1-2*pt(-2, df=5).
Notice that, if σ 2 were known, the probability that Ȳ will fall within 2σȲ
of µ would be given by

√
2σ
2σ
n(Ȳ − µ)
P − √ ≤ (Ȳ − µ) ≤ √
= P −2 ≤
≤2
σ
n
n
= P (−2 ≤ Z ≤ 2) = 0.9544.
25 / 45
Comparing Variances of Two Normal Distributions
2
Suppose we have two independent normal populations X ∼ N (µX , σX
)
and Y ∼ N (µY , σY2 ) and all the parameters are unknown. We are interested
2
in comparing of σX
and σY2 .
We can randomly select a sample from each population. They are
2
) (sample from population X)
X1 , · · · , Xn ∼ iid N (µX , σX
Y1 , · · · , Ym ∼ iid N (µY , σY2 ) (sample from population Y )
2
It seems intuitive that the ratio SX
/SY2 could be used to make inferences
2
2
about the relative magnitudes of σX
and σY2 , where SX
and SY2 are sample
variances of these two samples.
2
What are the distribution related to SX
/SY2 ?
26 / 45
F Distribution
Let W1 ∼ χ2 (m1 ) and W2 ∼ χ2 (m2 ) and they are independent. Then,
F =
W1 /m1
W2 /m2
is said to have an F distribution with m1 numerator degrees of freedom and
m2 denominator degrees of freedom. The pdf is
f (y) =
m1 +m2
m1
m2
m1
m1 +m2
2 m2 m1 2 m2 2 y 2 −1 (m1 + m2 y) 2 , y > 0
m1
Γ 2
2
Γ
Γ

We may write F ∼ F (m1 , m2 ) and f (y) as fF (m1 ,m2 ) (y).
2m2 (m +m −2)
1
2
2
2
If m2 > 2, E(F ) = mm
. If m2 > 4, V ar(F ) = m1 (m−2)
2 (m t−4) .
2 −2
2
A basic fact about the F distribution is that if F ∼ F (m1 , m2 ), then
1/F ∼ F (m2 , m1 ).
27 / 45
F Distribution
The pdf of the F distribution looks like a gamma pdf.
Table 7 in the “statistical table” file tabulates the upper α quantiles
Fα (m1 , m2 ) (i.e., 1 − α quantiles) of F (m1 , m2 for different α and degrees
of freedom m1 , m2 , which implies
P (F > Fα (m1 , m2 )) = α,
i.e.,
P (F ≤ Fα (m1 , m2 )) = 1 − α.
For example, F0.025 (4, 17) = 3.66 and F0.05 (4, 17) = 2.96.
28 / 45
Comparing Variances of Two Normal Distributions
2
Theorem 5: Suppose X1 , · · · , Xn ∼ iid N (µX , σX
) (first sample) and
Y1 , · · · , Ym ∼ iid N (µY , σY2 ) (second sample) and the two samples are
2
independent. Define SX
=
F =
Pn
i=1 (Xi −X̄)
n−1
2
and SY2 =
Pm
i=1 (Yi −Ȳ )
m−1
2
. Then,
2
2
SX
/σX
∼ F (n − 1, m − 1).
2
SY /σY2
Proof:
From theorem 3, we get W1 =
W2 =
2
(m−1)SY
2
σY
2
(n−1)SX
∼ χ2 (n − 1) and
2
σX
∼ χ2 (m − 1). Since the two samples are independent, W1
and W2 are independent. By the definition of the F distribution, we obtain
F =
2
W1 /(n − 1)
S 2 /σX
= X
∼ F (n − 1, m − 1).
2
W2 /(m − 1)
SY /σY2
29 / 45
Comparing Variances of Two Normal Distributions
Example 6: Refer to Example 2. Suppose that another sample of 5 bottles is
to be taken from the output of the same bottling machine. Find the
probability that the sample variance of the volumes in these 5 bottles will be
at least 7 times as large as the sample variance of the volumes in the 9 bottles
that were initially sampled.
Solution: Since the two samples are taken from the same population, we
2
have σX
= σY2 . So
2
P (SX
> 7SY2 )
=

2
2
SX
/σX
P
>7
SY2 /σY2
P (F > 7), where F ∼ F (4, 8)
≈
P (F > 7.01) = 0.01,

=
by the F table.
Or using R, get P (F > 7) = 0.01002557 by command 1-pf(7,df1=4,df2=8).
30 / 45
The Central Limit Theorem
31 / 45
The Central Limit Theorem
In the previous section, we assumed Y1 , · · · , Yn ∼ iid N (µ, σ 2 ). What if
Y1 , · · · , Yn are not normally distributed (e.g. Example 1)?
Can we still find the sampling distribution of Ȳ ? We can rely on “The
Central Limit Theorem”.
32 / 45
Convergence in Distribution
Consider a random variable Y and a sequence of random variables Xn
indexed by n = 1, 2, 3, · · · (that is, X1 , X2 , X3 , · · · ). Let FY (·) and FXn (·)
be the cdfs of Y and Xn , respectively.
Suppose
FXn (y) → FY (y) as n → ∞
for each value y ∈ R at which FY (y) is continuous. Then we say that
Xn converges to Y in distribution as n → ∞,
and we may express this statement mathematically as
d
Xn −−→ Y as n → ∞.
33 / 45
The Central Limit Theorem
Theorem 6 (CLT): Let Y1 , · · · , Yn be independent and identically
distributed random variables with E(Yi ) = µ and V ar(Yi ) = σ 2 < ∞. Define Ȳ − µ √ = Un = σ/ n √ n(Ȳ − µ) . σ Then, d Un −−→ Z as n → ∞, d where Z ∼ N (0, 1). Or you can write Un −−→ N (0, 1) as n → ∞. Note: The CLT implies that when n is large, it is reasonable to make the 2 following equivalent approximating statements: Ȳ ∼ ˙ N (µ, σn ) or Pn ˙ N (nµ, nσ 2 ). i=1 Yi = nȲ ∼ 34 / 45 The Central Limit Theorem Note: The CLT makes no particular distributional assumptions about Yi s. It assumes only that they are iid and with some common finite mean µ and common finite non-zero variance σ 2 . Since the cdf of Z, Φ(z), is continuous at all values z ∈ R, we have P (Un ≤ z) = FUn (z) → Φ(z) as n → ∞ for any z ∈ R. It implies that probability of Un can be approximated by N (0, 1) if n is large. Usually, a value of n greater than 30 will ensure that the distribution of Un can be closely approximated by N (0, 1). 35 / 45 The Central Limit Theorem Example 7: 200 numbers are randomly chosen from between 0 and 1. Find the probability that the average of these numbers is greater than 0.53. Solution: Let Yi be the ith number, i = 1, · · · , n, where n = 200. Then, Y1 , · · · , Yn ∼ iid U (0, 1) so that µ = E(Yi ) = 1/2 and σ 2 = V ar(Yi ) = 1/12. Applying the CLT, we find √ √ n(Ȳ − µ) n(0.53 − µ) P (Ȳ > 0.53) = P
>
σ
σ
√
= P
Z>
200(0.53 − 0.5)
p
1/12
!
≈ P (Z > 1.47) = 0.0708.
Hence, the probability of interest is approximately 0.0708 which is very
close to the true probability due to large sample size.
36 / 45
The Central Limit Theorem
Example 8: The service times for customers coming through a checkout
counter in a retail store are independent and identically distributed random
variables with mean 1.5 minutes and variance 1. Find the probability that
100 customers can be served in less than 2 hours of total service time.
Solution: If we let Yi denote the service time for the ith customer, then we
want
!
P
100
X
Yi ≤ 120
= P (Ȳ ≤ 120/200) = P (Ȳ ≤ 1.2).
i=1
Since the sample size n = 100 is large, we apply the CLT and get
√
P (Ȳ ≤ 1.2) = P
100(Ȳ − 1.5)
≤
1
√
100(1.2 − 1.5)
1
!
≈ P (Z ≤ −3) = 0.0013.
This small probability indicates that it is virtually impossible to serve 100
customers in only 2 hours.
37 / 45
The Normal Approximation to
the Binomial Distribution
38 / 45
Normal Approximation to Bin(n, p)
Theorem 7 (CLT for Sample Proportion)
Suppose that Y ∼ Bin(n, p). Then Y =
Pn
Y1 , · · · , Yn ∼ iid Bern(p). Define p̂ = Yn =
i=1 Yi , where
Pn
i=1 Yi
, which is regarded as
n
the sample proportion. Since µ = E(Yi ) = p and
σ 2 = V ar(Yi ) = p(1 − p), by the CLT, we have
√
n(p̂ − p) d
p
−−→ N (0, 1) as n → ∞.
p(1 − p)
Note: Equivalently, you can say p̂ ∼
˙ N (p, p(1 − p)/n) or
Pn
Y = i=1 Yi ∼
˙ N (np, np(1 − p)) when n is large.
39 / 45
Normal Approximation to Bin(n, p)
Example 9: Candidate A believes that she can win a city election if she can
earn at least 55% of the votes in precinct 1. She also believes that about 50%
of the city’s voters favor her. If n = 100 voters show up to vote at precinct 1,
what is the probability that candidate A will receive at least 55% of their
votes?
Solution: Let Y denote the number of voters at precinct 1 who vote for
candidate A. If we think of the n = 100 voters at precinct 1 as a random
sample from the city, then Y ∼ Bin(n = 100, p = 0.5). We want to know
P (p̂ ≥ 0.55) where p̂ = Y /n.
Since p̂ ∼
˙ N (p = 0.5, p(1 − p)/n = 0.0025), we get

0.55−0.5
√
P (p̂ ≥ 0.55) = P √p̂−0.5
≥
≈ P (Z ≥ 1) = 0.1587.
0.0025
0.0025
40 / 45
Normal Approximation to Bin(n, p)
Example 10: A die is rolled n = 120 times. Find the probability that at least
27 sixes come up.
Solution: Let Y be the number of 6s. Then Y ∼ Bin(120, 1/6) so that

Y ∼
˙ N 120 16 , 120 16 65 , i.e., N (20, 50/3). So
P (Y ≥ 27)
≈ P (U ≥ 27)
= P
=
where U ∼ N (20, 50/3)
!
27 − 20
Z≥ p
50/3
P (Z > 1.7146) = 0.0432.
41 / 45
The Continuity Correction
The exact probability is the area of the boxes above 27, 28, 29, · · · . We
have approximated this probability by 0.0436, which is the area under the
approximating normal density to the right of 27. But this area seems to be
too small by about half the area of the box above 27 (i.e. the left half of that
box). Thus, it would appear that a better approximation would be the area to
the right of 27 − 0.5 = 26.5 (as shaded below). We call “-0.5” here the
continuity correction.
42 / 45
The Continuity Correction
Let’s now apply this continuity correction to see what difference it makes:
P (Y ≥ 27)
≈
P (U ≥ 27 − 0.5)
where U ∼ N (20, 50/3)
!
27 − 0.5 − 20
p
Z≥
50/3
=
P
=
P (Z > 1.59) = 0.0559.
And the the exact probability can in fact be calculated
P (Y ≥ 27) =
y 120−y
100
X
5
120
1
y=27
y
6
6
= 0.0597.
We see that the continuity correction here does indeed improve the
approximation.
43 / 45
The Continuity Correction
Similarly, if you are interested in P (Y ≤ 10), P (U ≤ 10 + 0.5) would be
a better approximation than P (U ≤ 10).
In summary, the 0.5 that we added to the largest value of interest (making
it a little larger) and subtracted from the smallest value of interest (making it
a little smaller) is commonly called the continuity correction associated
with the normal approximation.
The normal approximation to binomial probabilities works well even for
moderately large n as long as p is not close to zero or one. A useful rule of
thumb is that the normal approximation to the binomial distribution is
p
p
appropriate when 0 < p − 3 pq/n and p + 3 pq/n < 1 where q = 1 − p. 44 / 45 Conclusion • Have a good knowledge of the meaning of statistics and sampling distributions. • Master the sampling distributions related to normal distributions. • Be familiar with the properties of the Chi-square, t and F distribution and be able to find their quantiles. • Know how to use the CLT to make inferences about the sample mean and sample proportion. 45 / 45 Chapter 8: Estimation Part (a): Point Estimation STAT6039 Principles of Mathematical Statistics Introduction The purpose of statistics is to use the information contained in a sample to make inferences about the population from which the sample is taken. Since populations can be characterized by some numerical descriptive measures called parameters, the objective of many statistical investigations is to estimate the value of one or more relevant parameters. Some important population parameters are the population mean µ, population proportion p, population variance σ 2 , and population standard deviation σ, or the functions of these parameters, etc. For example, we might wish to estimate the mean waiting time µ at a supermarket checkout station or the standard deviation σ of the error of measurement of an electronic instrument. 1 / 48 Terminology Target parameter is the parameter of interest in the experiment. Point estimation: • Use a single number to estimate the target parameter. • For example, the point estimate of the average height of ANU students, µ, is 165cm. Interval estimation: • Use an interval to estimate the target parameter. • For example, the average height of ANU students, µ, will fall between 150cm and 180cm, which is an interval estimate of µ. 2 / 48 Terminology Estimator: • An estimator is a rule, often expressed as a formula, that tells how to estimate a population parameter based on the measurements contained in a sample. So it is a statistic or an interval with at least one endpoint being statistics. Pn • For example, one point estimator of µ is Ȳ = i=1 Yi and one interval n √ √ estimator of µ is [Ȳ − zα/2 S/ n, Ȳ + zα/2 S/ n]. Estimate: • An estimate is a realized value of an estimator, calculate based on the sample data. • For example, ȳ = 160+164+170+173+158 = 165(cm). 5 3 / 48 Point Estimation 4 / 48 Point Estimation Many different estimators may be obtained for the same population parameter. For example, sample mean and sample median can both be used to estimate population mean. How to evaluate the performance of an estimator? 5 / 48 Bias of a Point Estimator Let θ be some target population parameter and let θ̂ denote a point estimator of θ. The bias of a point estimator θ̂ is given by B(θ̂) = E(θ̂) − θ. If B(θ̂) = 0, or equivalently E(θ̂) = θ, we say θ̂ is an unbiased estimator of θ. If B(θ̂) ̸= 0, θ̂ is said to be biased. Unbiasedness is a desirable quality of a point estimator. If B(θ̂) → 0 as n → ∞, we say θ̂ is asymptotically unbiased. 6 / 48 Bias of a Point Estimator Theorem 1: Let Y1 , · · · , Yn ∼ iid (µ, σ 2 ) where σ 2 < ∞. Define Ȳ = Pn i=1 Yi n and S 2 = 2 i=1 (Yi −Ȳ ) Pn n−1 . Then, (a) Ȳ is an unbiased estimator of population mean µ. (b) S 2 is an unbiased estimator of σ 2 . Corollary: Sample proportion p̂ = Y /...

Turn in your highest-quality paper
Get a qualified writer to help you with

“ UW Principles of Mathematical Statistics Questions ”

Get high-quality paper

NEW! AI matching with writer