Topic 1 Interlude: R - Basic cleaning, loops, and alternatives

Context

What is the purpose of these notes?

Provide a few small examples of functions in R;
Provide html/Markdown with several lines of R code you can use to practice writing functions.

Agenda

A common data cleaning task
For/while loops to iterate over data
Helpful variants of map, mutate and summarize

Package loading

library(tidyverse)
Cars93 <- MASS::Cars93  # For Cars93 data again

A common problem: messy data

One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns
For a simple example, let’s look at TVhours column in a messy version of the survey data from the following link http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv:

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE)
survey.messy$TVhours

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

What’s happening?

str(survey.messy)

'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...

Several of the entries have non-numeric values in them (they contain strings)
As a result, TVhours is being imported as factor

A look at the TVhours column

survey.messy$TVhours

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

Attempt at a fix

What if we just try to cast it back to numeric?

tv.hours.messy <- survey.messy$TVhours
tv.hours.messy

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

as.numeric(tv.hours.messy)

Warning: NAs introduced by coercion

 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0

That didn’t work…

tv.hours.messy
as.numeric(tv.hours.messy)

 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"   
[31] "0"    "5"    "0"    "0"    "0"    "4"    "2"    "3"    "14"   "0"

Warning in head(as.numeric(tv.hours.messy), 40): NAs introduced by coercion

 [1] 20  6 10  2 NA 10 15  3  0  0  5  2 10 40 NA  5  3 20  0 10  2 10  8  8  2
[26] 12  5  6  4  4  0  5  0  0  0  4  2  3 14  0

This just converted all the values into the integer-coded levels of the factor
Not what we wanted!

Something that does work

Consider the following simple example

num.vec <- c(3.1, 2.5)
as.factor(num.vec)

[1] 3.1 2.5
Levels: 2.5 3.1

as.numeric(as.factor(num.vec))

[1] 2 1

as.numeric(as.character(as.factor(num.vec)))

[1] 3.1 2.5

If we take a number that’s being coded as a factor and first turn it into a character string, then converting the string to a numeric gets back the number

Back to the corrupted TVhours column

as.character(tv.hours.messy)

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

as.numeric(as.character(tv.hours.messy))

Warning: NAs introduced by coercion

 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0

typeof(as.numeric(as.character(tv.hours.messy)))  # Success!! (Almost...)

Warning in typeof(as.numeric(as.character(tv.hours.messy))): NAs introduced by
coercion

[1] "double"

A small improvement

All the corrupted cells now appear as NA, which is R’s missing indicator
We can do a little better by cleaning up the vector once we get it to character form

tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

Deleting non-numeric (or .) characters

tv.hours.strings

 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"

# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)

 [1] "20"  "6"   "10"  "2"   ""    "10"  "15"  "3"   "0"   "0"   "5"   "2"  
[13] "10"  "40"  ""    "5"   "3"   "20"  "0"   "10"  "2"   "10"  "8"   "8"  
[25] "2"   "12"  "5"   "6"   "4"   "4"   "0"   "5"   "0"   "0"   "0"   "4"  
[37] "2"   "3"   "14"  "0"   "3"   "7"   "0"   "3"   "7"   "4"   "5"   "1.5"
[49] "4"   "10"  "0"   "0"   "4"   "2"   "0"   "0"   "1"   "0"   "2"   "2"  
[61] "0"   "0.5" "6"   "3"   "6"   "1"   "5"

The final product

tv.hours.messy[1:30]

 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"

tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean

 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0 10.0  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5  6.0  3.0  6.0  1.0  5.0

As a last step, we should go through and figure out if any of the NA values should really be 0.
- This step is not shown here.

Rebuilding our data

survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)

'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...

Success!

A different approach

We can also handle this problem by setting stringsAsFactors = FALSE when importing our data.

survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)

'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...

Now everything is a character instead of a factor

One-line cleanup

Let’s clean up the TVhours column and cast it to numeric all in one command

survey <- mutate(survey.messy, 
        TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)

'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...

What about all those other `character` variables?

table(survey[["Program"]])


 MISM Other   PPM 
    8    17    42

table(as.factor(survey[["Program"]]))


 MISM Other   PPM 
    8    17    42

Having factors coded as characters may be OK for many parts of our analysis

If we wanted to, here’s how we could fix things

mutate_if(.tbl, .predicate, .funs) applies a functions .funs to the elements of .tbl for which the predicate (condition) .predicate holds.

Here is how we can use mutate_if to convert all character columns to factors.

survey <- survey %>% mutate_if(is.character, as.factor)
# Outcome:
str(survey)

'data.frame':   67 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...

Success!

Another common problem

In various homework assignments, you’ll learn how to wrangle with another common problem
When data is entered manually, misspellings and case changes are very common
E.g., a column showing treatment information may look like,

treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)

   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1

summary(treatment)

   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1

This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There’s a faster way to fix just the capitalization issue (this will be an exercise in one of the HW sets)

What are all these `map<*>` functions?

These are all efficient ways of applying a function to margins of an array or elements of a list
Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops
loops are ways of iterating over data
The map<*> functions and their <*>apply base-R ancestors can be thought of as good alternatives to loops

`for` loops

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4

phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}

[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

The basic syntax takes the form

for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}

The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}

[1] "Michael"   "character"
[1] "185"    "double"
[1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows

          [,1]       [,2]      [,3]      [,4]       [,5]
[1,] -1.146664 -0.7225478 0.3190678 0.8936148 -0.1715746
[2,] -1.534036 -0.2740636 1.4878076 1.0369705  1.6525823

col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums

[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159

colSums(fake.data) # A better approach (see also colMeans())

[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159

`while` loops

while loops repeat a chunk of code while the specified condition remains true

day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}

We won’t really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command	Description
`apply(X, MARGIN, FUN)`	Obtain a vector/array/list by applying `FUN` along the specified `MARGIN` of an array or matrix `X`
`map(.x, .f, ...)`	Obtain a list by applying `.f` to every element of a list or atomic vector `.x`
`map_<type>(.x, .f, ...)`	For `<type>` given by `lgl` (logical), `int` (integer), `dbl` (double) or `chr` (character), return a vector of this type obtained by applying `.f` to each element of `.x`
`map_at(.x, .at, .f)`	Obtain a list by applying `.f` to the elements of `.x` specified by name or index given in `.at`
`map_if(.x, .p, .f)`	Obtain a list `.f` to the elements of `.x` specified by `.p` (a predicate function, or a logical vector)
`mutate_all/_at/_if`	Mutate all variables, specified (at) variables, or those selected by a predicate (if)
`summarize_all/_at/_if`	Summarize all variables, specified variables, or those selected by a predicate (if)

These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)

[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159

apply(fake.data, MARGIN=2, FUN=mean)

[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159

# MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion
# of vector indices that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)

[1] 0.44 0.49 0.42 0.47 0.39

Example: map, map_()

map(survey, is.factor) # Returns a list

$Program
[1] TRUE

$PriorExp
[1] TRUE

$Rexperience
[1] TRUE

$OperatingSystem
[1] TRUE

$TVhours
[1] FALSE

$Editor
[1] TRUE

map_lgl(survey, is.factor) # Returns a logical vector with named elements

        Program        PriorExp     Rexperience OperatingSystem         TVhours 
           TRUE            TRUE            TRUE            TRUE           FALSE 
         Editor 
           TRUE

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays

speed  dist 
15.40 42.98

map(cars, mean) # Data frames are also lists

$speed
[1] 15.4

$dist
[1] 42.98

map_dbl(cars, mean) # map output as a double vector

speed  dist 
15.40 42.98

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type)

[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van

Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)

[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Note: this operation is performed “in place”, replacing prior versions of the variables

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, 
                          list(lower = tolower))
head(Cars93.lower$Type)

[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van

head(Cars93.lower$Type_lower)

[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), 
            list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))

[1] "Luggage.room"     "Weight"           "Origin"           "Make"            
[5] "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)

  Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
       RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
     Width Turn.circle Rear.seat.room Luggage.room   Weight
1 69.37634    38.95699             NA           NA 3072.903

Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)

  Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
1       17.12581   19.50968       21.89892      22.36559         29.08602
  EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
1        2.667742         143.828 5280.645          2332.204
  Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean Width_mean
1                16.66452        5.086022    183.2043       103.9462   69.37634
  Turn.circle_mean Rear.seat.room_mean Luggage.room_mean Weight_mean
1         38.95699            27.82967          13.89024    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

We’ll use the helper function contains() to indicate that we want all variables that contain MPG.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(c("MPG.city", "MPG.highway"), mean)

# A tibble: 6 x 3
  Type    MPG.city MPG.highway
* <fct>      <dbl>       <dbl>
1 Compact     22.7        29.9
2 Large       18.4        26.7
3 Midsize     19.5        26.7
4 Small       29.9        35.5
5 Sporty      21.8        28.8
6 Van         17          21.9

Another approach

We might learn about a bunch of helper functions like contains() and starts_with(). Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))

# A tibble: 6 x 3
  Type    MPG.city_mean MPG.highway_mean
* <fct>           <dbl>            <dbl>
1 Compact          22.7             29.9
2 Large            18.4             26.7
3 Midsize          19.5             26.7
4 Small            29.9             35.5
5 Sporty           21.8             28.8
6 Van              17               21.9

More than one grouping variable

Cars93 %>%
  group_by(Origin, AirBags) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))

# A tibble: 6 x 4
# Groups:   Origin [2]
  Origin  AirBags            MPG.city_mean MPG.highway_mean
  <fct>   <fct>                      <dbl>            <dbl>
1 USA     Driver & Passenger          19               27.2
2 USA     Driver only                 20.2             27.5
3 USA     None                        23.1             29.6
4 non-USA Driver & Passenger          20.3             27  
5 non-USA Driver only                 23.2             29.4
6 non-USA None                        25.9             32

License

This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

This set of notes is adapted from Prof. Alexandra Chouldechova at CMU, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎

Topic 1 Interlude: R - Basic cleaning, loops, and alternatives