Topic 1.1 - Descriptive statistics

Background

What is the purpose of these notes?

Provide a few small examples of descriptive statistics;
See bar plots, histograms, boxplots;
Frequency distribution - visualization.

What is the format of this document?

(One more time, I will add these notes.)
This document was created using R Markdown. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd document, it will be a working template to start from. If you are used to using LaTeX, no worries: it can be embedded into Markdown, with overall simpler formatting. I hope you find this useful!

Loading required packages

To run some of the commands below, you need to load qqplot2:

library(ggplot2)

Example of a small data set

In R, the letter “c” stands for “column” and is used to create a data vector (column vector) of numbers. Think of it as an ordered list, simply.

Example from a book on car battery data

Note that we can manually enter data, in a list, and store it under the name car.batteries. (Of course this is not how you’ll be loading the data in practice!)

car.batteries<-c(
  2.2,4.1,3.5, 4.5, 3.2, 3.7, 3.0, 2.6, 3.4, 1.6, 3.1, 3.3, 3.8, 3.1, 4.7, 3.7, 2.5, 
  4.3, 3.4, 3.6, 2.9, 3.3, 3.9, 3.1, 3.3, 3.1, 3.7, 4.4, 3.2, 4.1, 1.9, 3.4, 4.7, 
  3.8, 3.2, 2.6, 3.9, 3.0, 4.2, 3.5
)

Let’s plot the histogram of the above!

hist(car.batteries)

hist(car.batteries,freq=FALSE)

hist(car.batteries,freq=FALSE,ylim = c(0,1),xlim=c(1,6),main = paste("Histogram of..."),col="blue")
lines(density(car.batteries),col="red") # estimated distribution

Note that we can change the options… study the code. Note the figure width can be controlled, as well as the limits on the x and y axes, the main title, and the color.

Using descriptive statistics to classify ‘shape of data’

As n grows, the histograms may change. Here are examples of various normal distributions (you don’t need to know what that really means yet).

# set sample size so the code below can be re-run by just changing this quantity: 
sample.size = 100

What descriptive statistics would you use to summarize this kind of data?

par(mfrow = c(1,3)) # aligns figures in a grid! in this case, a 1x3 grid. 
# sample the data: 
x <- rnorm(n=sample.size)
y <- rnorm(n=sample.size,mean=250,sd=300)
z <- rnorm(n=sample.size,mean=10,sd=1)
# plot the histograms: 
hist(x)
hist(y)
hist(z)

Let us introduce the “five-number summary” + mean:

summary(x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.4472 -0.7830 -0.1021 -0.1202  0.5223  2.4056

summary(y)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -529.3    58.8   219.1   242.1   417.6  1298.7

summary(z)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  7.676   9.468   9.984   9.957  10.646  12.138

What happens when we change the sample size to \(n=1000\) and re-run the same commands as above?

Let us visualize the quantile summaries of a simple data set:

my.data<-x
fivenum(my.data)

[1] -3.09940576 -0.68541331 -0.03184644  0.65000801  2.72828859

par(mfrow = c(1, 2))
boxplot(my.data)
boxplot(my.data)
abline(h = min(my.data), col = "Blue")
abline(h = max(my.data), col = "Yellow")
abline(h = median(my.data), col = "Green")
abline(h = quantile(my.data, c(0.25, 0.75)), col = "Red")

In the code above, the third line par(mfrow = c(1, 2)) creates a grid of size 1x2 for plots; it divides the plot area into a grid so you see several plots on the same page as opposed to separately. Try changing the 1 and the 2 to something else!

Online resources that are extremely useful:

http://www.stat.cmu.edu/~cshalizi/rmarkdown/ and https://anaconda.org/anaconda/markdown

License

This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎