Topic 1.1 - Descriptive statistics
Background
What is the purpose of these notes?
- Provide a few small examples of descriptive statistics;
- See bar plots, histograms, boxplots;
- Frequency distribution - visualization.
What is the format of this document?
(One more time, I will add these notes.)
This document was created using R Markdown
. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd
document, it will be a working template to start from. If you are used to using LaTeX
, no worries: it can be embedded into Markdown
, with overall simpler formatting. I hope you find this useful!
Loading required packages
To run some of the commands below, you need to load qqplot2
:
Example of a small data set
In R, the letter “c” stands for “column” and is used to create a data vector (column vector) of numbers. Think of it as an ordered list, simply.
Example from a book on car battery data
Note that we can manually enter data, in a list, and store it under the name car.batteries
. (Of course this is not how you’ll be loading the data in practice!)
car.batteries<-c(
2.2,4.1,3.5, 4.5, 3.2, 3.7, 3.0, 2.6, 3.4, 1.6, 3.1, 3.3, 3.8, 3.1, 4.7, 3.7, 2.5,
4.3, 3.4, 3.6, 2.9, 3.3, 3.9, 3.1, 3.3, 3.1, 3.7, 4.4, 3.2, 4.1, 1.9, 3.4, 4.7,
3.8, 3.2, 2.6, 3.9, 3.0, 4.2, 3.5
)
Let’s plot the histogram of the above!
hist(car.batteries,freq=FALSE,ylim = c(0,1),xlim=c(1,6),main = paste("Histogram of..."),col="blue")
lines(density(car.batteries),col="red") # estimated distribution
Note that we can change the options… study the code. Note the figure width can be controlled, as well as the limits on the x and y axes, the main title, and the color.
Using descriptive statistics to classify ‘shape of data’
As n grows, the histograms may change. Here are examples of various normal distributions (you don’t need to know what that really means yet).
What descriptive statistics would you use to summarize this kind of data?
par(mfrow = c(1,3)) # aligns figures in a grid! in this case, a 1x3 grid.
# sample the data:
x <- rnorm(n=sample.size)
y <- rnorm(n=sample.size,mean=250,sd=300)
z <- rnorm(n=sample.size,mean=10,sd=1)
# plot the histograms:
hist(x)
hist(y)
hist(z)
Let us introduce the “five-number summary” + mean:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.4472 -0.7830 -0.1021 -0.1202 0.5223 2.4056
Min. 1st Qu. Median Mean 3rd Qu. Max.
-529.3 58.8 219.1 242.1 417.6 1298.7
Min. 1st Qu. Median Mean 3rd Qu. Max.
7.676 9.468 9.984 9.957 10.646 12.138
What happens when we change the sample size to \(n=1000\) and re-run the same commands as above?
Let us visualize the quantile summaries of a simple data set:
[1] -3.09940576 -0.68541331 -0.03184644 0.65000801 2.72828859
par(mfrow = c(1, 2))
boxplot(my.data)
boxplot(my.data)
abline(h = min(my.data), col = "Blue")
abline(h = max(my.data), col = "Yellow")
abline(h = median(my.data), col = "Green")
abline(h = quantile(my.data, c(0.25, 0.75)), col = "Red")
In the code above, the third line par(mfrow = c(1, 2)) creates a grid of size 1x2 for plots; it divides the plot area into a grid so you see several plots on the same page as opposed to separately. Try changing the 1 and the 2 to something else!
Online resources that are extremely useful:
http://www.stat.cmu.edu/~cshalizi/rmarkdown/ and https://anaconda.org/anaconda/markdown
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.