Sampling distributions - some illustrative examples
Background
What is the purpose of these notes?
- Provide a few small examples of sampling distributions;
- Introduce you to
R Markdown
; - Give you several lines of
R
code you can use; - Motivate a few questions about
- why do we care about sampling distributions?
- why do we care about udnerstanding the models that best fit the data?
What is the format of this document?
This document was created using R Markdown
. You can read
more about it here and check
out a cheat
sheet here, which will guide you through installing RStudio, and
from there the moment you create a new .Rmd
document, it
will be a working template to start from. If you are used to using
LaTeX
, no worries: it can be embedded into
Markdown
, with overall simpler formatting. I hope you find
this useful!
Installing and loading packages
Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.
To install a package, use the
install.packages()
command. You’ll want to run the
following commands to get the necessary packages for today’s lab:
install.packages("rmdformats")
install.packages("ggplot2")
install.packages("knitr")
You only need to install packages once. Once they’re installed, you
may use them by loading the libraries using the
library()
command. For today’s lab, you’ll want to run the
following code
library(ggplot2) # graphics library
library(knitr) # contains kable() function
options(scipen = 4) # Suppresses scientific notation
Context
As we learned in this week’s lectures, statistics are functions of random variables and therefore are random variables themselves. In particular, they have their own distributions, called sampling distributions. Here we take a look at a few examples of those distributions.
Sampling distributions from a coin simulation
This is a very simple example of simulating a coin toss, borrowed from my colleague at UIC: http://homepages.math.uic.edu/~jyang06/stat411/handouts/handout3.pdf. Let \(x\) be the random variable recording the outcome of a coin toss: \(x_i=0\) if we see Tail on the \(i\)-th trial (toss), and \(x_i=1\) if we see heads on the \(i\)-th trial.
<- 100000 # number of random experiments
n <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
x <- sample(x, size=n, replace=TRUE) coin.simulation
If you’d like to see what options we are using for sampling, and why, try this:
help(sample)
Let us count the number of heads, \(Y\), in this experiment. The number of heads observed when tossing a coin \(n\) times is a statistics computed from the sample of \(n\) coin tosses. \(Y\) is itself a random variable. It has a distribution, just like \(X\), the coin toss.
= sum(coin.simulation)
y y
[1] 49901
What is this number? It is a statistic, being the following function of the sample: \[\sum_{i=1}^n x_i,\] where \(n\) is the total sample size (in this case it was \(n=\) 100000), and \(x_i\) is the \(i\)-th observation of the random variable \(x\). The observed value of \(Y\) is \(y=\) 49901.
What can we say about the number of heads?
Let us fix the sample size to be \(n=10000\) and explore the number of heads seen when the coin is tossed \(n\) times.
- Some questions that should be on your mind are: is the number of heads supposed to be \(n/2\)? How far off is it? does it vary? what does this mean?
Let’s explore how this value varies in repeated experiments.
=10000
n.10reps<-replicate(10, sum(sample(x, size=n, replace=TRUE)))
sums.in.repeated.sampling.100reps<-replicate(100, sum(sample(x, size=n, replace=TRUE)))
sums.in.repeated.sampling.1Kreps<-replicate(10000, sum(sample(x, size=n, replace=TRUE)))
sums.in.repeated.samplingpar(mfrow=c(1,3))
hist(sums.in.repeated.sampling.10reps,main = paste("Number of heads: 10 reps"))
hist(sums.in.repeated.sampling.100reps,main = paste("Number of heads: 100 reps"))
hist(sums.in.repeated.sampling.1Kreps,main = paste("Number of heads: 1K reps"))
Conclusion
The experiments above suggest that the sampling distribution of \(Y\) appears to have a mean around the expected number of heads when a fair coin is tossed, which is about \(n/2\). The more times we repeat the experiment of \(n\) coin tosses, the closer \(Y\) gets to its expected value – this can be measured by looking at both the mean and the variance of Y.
<-c(mean(sums.in.repeated.sampling.10reps),
means.of.Ymean(sums.in.repeated.sampling.100reps),
mean(sums.in.repeated.sampling.1Kreps))
<-c(var(sums.in.repeated.sampling.100reps),
vars.of.Yvar(sums.in.repeated.sampling.100reps),
var(sums.in.repeated.sampling.1Kreps))
library(knitr)
kable(means.of.Y, align = "l",col.names="Means of Y")
Means of Y |
---|
5007.000 |
4998.050 |
4999.444 |
kable(vars.of.Y, align = "l",col.names="Vars of Y")
Vars of Y |
---|
2490.351 |
2490.351 |
2496.438 |
Question: is it possible that something similar to this always happens?
As we will see, the sampling distribution of \(Y\) is approximately normal with mean equal to the expected value of \(X\). In other words, the example above illustrates a known result–the Central Limit Theorem, one of the cornerstone results used in inference. You should already be familiar with it from your probability class.
Sampling distribution from casting a die
Consider casting a (regular, 6-sided) die. Perhaps you are interested in the number of times a “3” is observed.
Let’s set up the experiment:
<- c(1,2,3,4,5,6) # sample space for casting a die
x # in case we forgot what n was :) n
[1] 10000
<- sample(x, size=n, replace=TRUE)
die.simulation sum(die.simulation==3) # count number of "3"
[1] 1693
What do you see? We just rolled a die \(n\) times and this is how many times we saw the number 3. Why don’t we re-run that experiment for various \(n\)?
par(mfrow=c(2,3))
for(n in c(1,10,100,1000,10000,100000)){
<- sample(x, size=n, replace=TRUE)
die.simulation barplot(table(die.simulation),
main = paste("Rolling a die", n, "times") ,
xlim=c(0,7), col="mistyrose3"
#,ylab="Relative frequency"
) }
Importance of sampling distributions
Sampling distributions tell a story about the model behind the data (i.e., the probability distribution or population from which the data was sampled) and give a glimpse into how it was generated.
This example illustrates their importance.
<- 1000
N <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
components <- c(0,40,10)
mus <- sqrt(c(1,1,0.1))
sds <- rnorm(n=N,mean=mus[components],sd=sds[components])
samples hist(samples)
summary(samples)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.456 1.502 37.700 22.488 39.988 42.684
Hmm… is it strange to see “two bumps” in the histogram instead of one, as usual? Maybe the sample size is too small, we need to simulate more data (from the same model). Let’s do that:
<- 10000
N <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
components <- rnorm(n=N,mean=mus[components],sd=sds[components])
samples par(mfrow=c(1,2))
hist(samples)
plot(density(samples))
The picture only became clearer: this data is not being drawn from anything like a normal distribution. Consequently, knowing simply the mean and the variance … is not enough to understand the data, that is, the data-generating mechanism behind it.
The above is an example of a mixture of normal distributions.
Let’s explore mixture distributions: what do they capture?
An example of a mixture of discrete distributions
Here is an example from http://web.stanford.edu/class/bios221/book/Chap-Mixtures.html.
Suppose we have two unfair coins, whose probabilities of heads are \(p_1 = 0.125\) and \(p_2=0.25\). With probability \(\pi\) we pick coin 1, with probability \(1-\pi\), coin 2. We then toss that coin twice and record the number of heads \(K\).
Simulate 100 instances of this procedure, with \(\pi=\frac{1}{8}\), and compute the contingency table of \(K\).
Do the same with \(\pi=\frac{1}{4}\).
If you were not told \(p_1\), \(p_2\) and \(\pi\), could you infer them from the contingency table?
= c(0.125, 0.25)
probHead for (pi in c(1/8, 1/4)) {
= sample(2, 100, replace = TRUE, prob = c(pi, 1-pi))
whCoin = rbinom(length(whCoin), size = 2, prob = probHead[whCoin])
K print(table(K))
}
K
0 1 2
66 25 9
K
0 1 2
71 28 1
hist(K) #print(K)
Ponder the last question: If you were not told \(p_1\), \(p_2\) and \(\pi\), could you infer them from the contingency table? This is at the heart of statistical reasnoning. In this course, Math 563, we will learn how to answer this question and how to determine when a valid answer cannot be obtained with certainty.
Appendix
Some resources and further reading:
- sampling distribution overview offers a low-level overview of what the topic is. I do not recommend it for PhD students, however, it may give additional examples to some of you who would like to brush up on some basic stats.
- View a short video tutorial about the Central Limit Theorem here, and view another set of examples introducing sampling distributions.
License
This document was created for Math 563, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.