Sampling distributions - some illustrative examples

Background

What is the purpose of these notes?

Provide a few small examples of sampling distributions;
Introduce you to R Markdown;
Give you several lines of R code you can use;
Motivate a few questions about

why do we care about sampling distributions?
why do we care about udnerstanding the models that best fit the data?

What is the format of this document?

This document was created using R Markdown. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd document, it will be a working template to start from. If you are used to using LaTeX, no worries: it can be embedded into Markdown, with overall simpler formatting. I hope you find this useful!

Installing and loading packages

Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.

To install a package, use the install.packages() command. You’ll want to run the following commands to get the necessary packages for today’s lab:

install.packages("rmdformats")
install.packages("ggplot2")
install.packages("knitr")

You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For today’s lab, you’ll want to run the following code

library(ggplot2) # graphics library
library(knitr)   # contains kable() function

options(scipen = 4)  # Suppresses scientific notation

Context

As we learned in this week’s lectures, statistics are functions of random variables and therefore are random variables themselves. In particular, they have their own distributions, called sampling distributions. Here we take a look at a few examples of those distributions.

Sampling distributions from a coin simulation

This is a very simple example of simulating a coin toss, borrowed from my colleague at UIC: http://homepages.math.uic.edu/~jyang06/stat411/handouts/handout3.pdf. Let \(x\) be the random variable recording the outcome of a coin toss: \(x_i=0\) if we see Tail on the \(i\)-th trial (toss), and \(x_i=1\) if we see heads on the \(i\)-th trial.

n <- 100000 # number of random experiments
x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
coin.simulation <- sample(x, size=n, replace=TRUE)

If you’d like to see what options we are using for sampling, and why, try this:

help(sample)

Let us count the number of heads, \(Y\), in this experiment. The number of heads observed when tossing a coin \(n\) times is a statistics computed from the sample of \(n\) coin tosses. \(Y\) is itself a random variable. It has a distribution, just like \(X\), the coin toss.

y = sum(coin.simulation) 
y

[1] 49901

What is this number? It is a statistic, being the following function of the sample: \[\sum_{i=1}^n x_i,\] where \(n\) is the total sample size (in this case it was \(n=\) 100000), and \(x_i\) is the \(i\)-th observation of the random variable \(x\). The observed value of \(Y\) is \(y=\) 49901.

What can we say about the number of heads?

Let us fix the sample size to be \(n=10000\) and explore the number of heads seen when the coin is tossed \(n\) times.

Some questions that should be on your mind are: is the number of heads supposed to be \(n/2\)? How far off is it? does it vary? what does this mean?

Let’s explore how this value varies in repeated experiments.

n=10000
sums.in.repeated.sampling.10reps<-replicate(10, sum(sample(x, size=n, replace=TRUE)))
sums.in.repeated.sampling.100reps<-replicate(100, sum(sample(x, size=n, replace=TRUE)))
sums.in.repeated.sampling.1Kreps<-replicate(10000, sum(sample(x, size=n, replace=TRUE)))
par(mfrow=c(1,3))
hist(sums.in.repeated.sampling.10reps,main = paste("Number of heads: 10 reps"))
hist(sums.in.repeated.sampling.100reps,main = paste("Number of heads: 100 reps"))
hist(sums.in.repeated.sampling.1Kreps,main = paste("Number of heads: 1K reps"))

Conclusion

The experiments above suggest that the sampling distribution of \(Y\) appears to have a mean around the expected number of heads when a fair coin is tossed, which is about \(n/2\). The more times we repeat the experiment of \(n\) coin tosses, the closer \(Y\) gets to its expected value – this can be measured by looking at both the mean and the variance of Y.

means.of.Y<-c(mean(sums.in.repeated.sampling.10reps),
              mean(sums.in.repeated.sampling.100reps),
              mean(sums.in.repeated.sampling.1Kreps))
vars.of.Y<-c(var(sums.in.repeated.sampling.100reps),
             var(sums.in.repeated.sampling.100reps), 
             var(sums.in.repeated.sampling.1Kreps))
library(knitr)
kable(means.of.Y, align = "l",col.names="Means of Y")

Means of Y
5007.000
4998.050
4999.444

kable(vars.of.Y, align = "l",col.names="Vars of Y")

Vars of Y
2490.351
2490.351
2496.438

Question: is it possible that something similar to this always happens?

As we will see, the sampling distribution of \(Y\) is approximately normal with mean equal to the expected value of \(X\). In other words, the example above illustrates a known result–the Central Limit Theorem, one of the cornerstone results used in inference. You should already be familiar with it from your probability class.

Sampling distribution from casting a die

Consider casting a (regular, 6-sided) die. Perhaps you are interested in the number of times a “3” is observed.

Let’s set up the experiment:

x <- c(1,2,3,4,5,6) # sample space for casting a die
n # in case we forgot what n was :)

[1] 10000

die.simulation <- sample(x, size=n, replace=TRUE)
sum(die.simulation==3) # count number of "3"

[1] 1693

What do you see? We just rolled a die \(n\) times and this is how many times we saw the number 3. Why don’t we re-run that experiment for various \(n\)?

par(mfrow=c(2,3))
for(n in c(1,10,100,1000,10000,100000)){
  die.simulation <- sample(x, size=n, replace=TRUE)
  barplot(table(die.simulation), 
          main = paste("Rolling a die", n, "times") , 
          xlim=c(0,7), col="mistyrose3"
          #,ylab="Relative frequency"
          )
}

Importance of sampling distributions

Sampling distributions tell a story about the model behind the data (i.e., the probability distribution or population from which the data was sampled) and give a glimpse into how it was generated.

This example illustrates their importance.

N <- 1000
components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
mus <- c(0,40,10)
sds <- sqrt(c(1,1,0.1))
samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
hist(samples)

summary(samples)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.456   1.502  37.700  22.488  39.988  42.684

Hmm… is it strange to see “two bumps” in the histogram instead of one, as usual? Maybe the sample size is too small, we need to simulate more data (from the same model). Let’s do that:

N <- 10000
components <- sample(1:3,prob=c(0.3,0.5,0.2),size=N,replace=TRUE)
samples <- rnorm(n=N,mean=mus[components],sd=sds[components])
par(mfrow=c(1,2))
hist(samples)
plot(density(samples))

The picture only became clearer: this data is not being drawn from anything like a normal distribution. Consequently, knowing simply the mean and the variance … is not enough to understand the data, that is, the data-generating mechanism behind it.

The above is an example of a mixture of normal distributions.

Let’s explore mixture distributions: what do they capture?

An example of a mixture of discrete distributions

Here is an example from http://web.stanford.edu/class/bios221/book/Chap-Mixtures.html.

Suppose we have two unfair coins, whose probabilities of heads are \(p_1 = 0.125\) and \(p_2=0.25\). With probability \(\pi\) we pick coin 1, with probability \(1-\pi\), coin 2. We then toss that coin twice and record the number of heads \(K\).

Simulate 100 instances of this procedure, with \(\pi=\frac{1}{8}\), and compute the contingency table of \(K\).
Do the same with \(\pi=\frac{1}{4}\).
If you were not told \(p_1\), \(p_2\) and \(\pi\), could you infer them from the contingency table?

probHead = c(0.125, 0.25)
for (pi in c(1/8, 1/4)) {
  whCoin = sample(2, 100, replace = TRUE, prob = c(pi, 1-pi))
  K = rbinom(length(whCoin), size = 2, prob = probHead[whCoin])
  print(table(K))
}

hist(K)   #print(K)

Ponder the last question: If you were not told \(p_1\), \(p_2\) and \(\pi\), could you infer them from the contingency table? This is at the heart of statistical reasnoning. In this course, Math 563, we will learn how to answer this question and how to determine when a valid answer cannot be obtained with certainty.

Appendix

Some resources and further reading:

sampling distribution overview offers a low-level overview of what the topic is. I do not recommend it for PhD students, however, it may give additional examples to some of you who would like to brush up on some basic stats.
View a short video tutorial about the Central Limit Theorem here, and view another set of examples introducing sampling distributions.

License

This document was created for Math 563, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎