Topic 1 Interlude: R - Basic cleaning, loops, and alternatives


Context

What is the purpose of these notes?

  1. Provide a few small examples of functions in R;
  2. Provide html/Markdown with several lines of R code you can use to practice writing functions.

Agenda

  • A common data cleaning task
  • For/while loops to iterate over data
  • Helpful variants of map, mutate and summarize

Package loading

library(tidyverse)
Cars93 <- MASS::Cars93  # For Cars93 data again

A common problem: messy data

survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE)
survey.messy$TVhours
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        

What’s happening?

str(survey.messy)
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
  • Several of the entries have non-numeric values in them (they contain strings)

  • As a result, TVhours is being imported as factor

A look at the TVhours column

survey.messy$TVhours
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        

Attempt at a fix

  • What if we just try to cast it back to numeric?
tv.hours.messy <- survey.messy$TVhours
tv.hours.messy
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
as.numeric(tv.hours.messy)
Warning: NAs introduced by coercion
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0

That didn’t work…

tv.hours.messy
as.numeric(tv.hours.messy)
 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"   
[31] "0"    "5"    "0"    "0"    "0"    "4"    "2"    "3"    "14"   "0"   
Warning in head(as.numeric(tv.hours.messy), 40): NAs introduced by coercion
 [1] 20  6 10  2 NA 10 15  3  0  0  5  2 10 40 NA  5  3 20  0 10  2 10  8  8  2
[26] 12  5  6  4  4  0  5  0  0  0  4  2  3 14  0
  • This just converted all the values into the integer-coded levels of the factor

  • Not what we wanted!

Something that does work

  • Consider the following simple example
num.vec <- c(3.1, 2.5)
as.factor(num.vec)
[1] 3.1 2.5
Levels: 2.5 3.1
as.numeric(as.factor(num.vec))
[1] 2 1
as.numeric(as.character(as.factor(num.vec)))
[1] 3.1 2.5

If we take a number that’s being coded as a factor and first turn it into a character string, then converting the string to a numeric gets back the number

Back to the corrupted TVhours column

as.character(tv.hours.messy)
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
as.numeric(as.character(tv.hours.messy))
Warning: NAs introduced by coercion
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0
typeof(as.numeric(as.character(tv.hours.messy)))  # Success!! (Almost...)
Warning in typeof(as.numeric(as.character(tv.hours.messy))): NAs introduced by
coercion
[1] "double"

A small improvement

  • All the corrupted cells now appear as NA, which is R’s missing indicator

  • We can do a little better by cleaning up the vector once we get it to character form

tv.hours.strings <- as.character(tv.hours.messy)
tv.hours.strings
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        

Deleting non-numeric (or .) characters

tv.hours.strings
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings) 
 [1] "20"  "6"   "10"  "2"   ""    "10"  "15"  "3"   "0"   "0"   "5"   "2"  
[13] "10"  "40"  ""    "5"   "3"   "20"  "0"   "10"  "2"   "10"  "8"   "8"  
[25] "2"   "12"  "5"   "6"   "4"   "4"   "0"   "5"   "0"   "0"   "0"   "4"  
[37] "2"   "3"   "14"  "0"   "3"   "7"   "0"   "3"   "7"   "4"   "5"   "1.5"
[49] "4"   "10"  "0"   "0"   "4"   "2"   "0"   "0"   "1"   "0"   "2"   "2"  
[61] "0"   "0.5" "6"   "3"   "6"   "1"   "5"  

The final product

tv.hours.messy[1:30]
 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"   
tv.hours.clean <- as.numeric(gsub("[^0-9.]", "", tv.hours.strings))
tv.hours.clean
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0 10.0  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5  6.0  3.0  6.0  1.0  5.0
  • As a last step, we should go through and figure out if any of the NA values should really be 0.
    • This step is not shown here.

Rebuilding our data

survey <- mutate(survey.messy, TVhours = tv.hours.clean)
str(survey)
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
  • Success!

A different approach

  • We can also handle this problem by setting stringsAsFactors = FALSE when importing our data.
survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
  • Now everything is a character instead of a factor

One-line cleanup

  • Let’s clean up the TVhours column and cast it to numeric all in one command
survey <- mutate(survey.messy, 
        TVhours = as.numeric(gsub("[^0-9.]", "", TVhours)))
str(survey)
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...

What about all those other character variables?

table(survey[["Program"]])

 MISM Other   PPM 
    8    17    42 
table(as.factor(survey[["Program"]]))

 MISM Other   PPM 
    8    17    42 
  • Having factors coded as characters may be OK for many parts of our analysis

If we wanted to, here’s how we could fix things

mutate_if(.tbl, .predicate, .funs) applies a functions .funs to the elements of .tbl for which the predicate (condition) .predicate holds.

Here is how we can use mutate_if to convert all character columns to factors.

survey <- survey %>% mutate_if(is.character, as.factor)
# Outcome:
str(survey)
'data.frame':   67 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
  • Success!

Another common problem

  • In various homework assignments, you’ll learn how to wrangle with another common problem

  • When data is entered manually, misspellings and case changes are very common

  • E.g., a column showing treatment information may look like,

treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)
   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 

summary(treatment)
   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 
  • This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)

  • We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later

  • There’s a faster way to fix just the capitalization issue (this will be an exercise in one of the HW sets)

What are all these map<*> functions?

  • These are all efficient ways of applying a function to margins of an array or elements of a list

  • Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops

  • loops are ways of iterating over data

  • The map<*> functions and their <*>apply base-R ancestors can be thought of as good alternatives to loops

for loops

For loops: a pair of examples

for(i in 1:4) {
  print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}
[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"

For loops: syntax

A for loop executes a chunk of code for every value of an index variable in an index set

  • The basic syntax takes the form
for(index.variable in index.set) {
  code to be repeated at every value of index.variable
}
  • The index set is often a vector of integers, but can be more general

Example

index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}
[1] "Michael"   "character"
[1] "185"    "double"
[1] "TRUE"    "logical"

Example: Calculate sum of each column

fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
          [,1]       [,2]      [,3]      [,4]       [,5]
[1,] -1.146664 -0.7225478 0.3190678 0.8936148 -0.1715746
[2,] -1.534036 -0.2740636 1.4878076 1.0369705  1.6525823
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159
colSums(fake.data) # A better approach (see also colMeans())
[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159

while loops

  • while loops repeat a chunk of code while the specified condition remains true
day <- 1
num.days <- 365
while(day <= num.days) {
  day <- day + 1
}
  • We won’t really be using while loops in this class

  • Just be aware that they exist, and that they may become useful to you at some point in your analytics career

Loop alternatives

Command Description
apply(X, MARGIN, FUN) Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X
map(.x, .f, ...) Obtain a list by applying .f to every element of a list or atomic vector .x
map_<type>(.x, .f, ...) For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x
map_at(.x, .at, .f) Obtain a list by applying .f to the elements of .x specified by name or index given in .at
map_if(.x, .p, .f) Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector)
mutate_all/_at/_if Mutate all variables, specified (at) variables, or those selected by a predicate (if)
summarize_all/_at/_if Summarize all variables, specified variables, or those selected by a predicate (if)
  • These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively

  • The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.

Example: apply()

colMeans(fake.data)
[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159
apply(fake.data, MARGIN=2, FUN=mean) 
[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159
# MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion
# of vector indices that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) 
[1] 0.44 0.49 0.42 0.47 0.39

Example: map, map_()

map(survey, is.factor) # Returns a list
$Program
[1] TRUE

$PriorExp
[1] TRUE

$Rexperience
[1] TRUE

$OperatingSystem
[1] TRUE

$TVhours
[1] FALSE

$Editor
[1] TRUE
map_lgl(survey, is.factor) # Returns a logical vector with named elements
        Program        PriorExp     Rexperience OperatingSystem         TVhours 
           TRUE            TRUE            TRUE            TRUE           FALSE 
         Editor 
           TRUE 

Example: apply(), map(), map_()

apply(cars, 2, FUN=mean) # Data frames are arrays
speed  dist 
15.40 42.98 
map(cars, mean) # Data frames are also lists
$speed
[1] 15.4

$dist
[1] 42.98
map_dbl(cars, mean) # map output as a double vector
speed  dist 
15.40 42.98 

Example: mutate_if

Let’s convert all factor variables in Cars93 to lowercase

head(Cars93$Type)
[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
Cars93.lower <- mutate_if(Cars93, is.factor, tolower)
head(Cars93.lower$Type)
[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Note: this operation is performed “in place”, replacing prior versions of the variables

Example: mutate_if, adding instead of replacing columns

If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables

Cars93.lower <- mutate_if(Cars93, is.factor, 
                          list(lower = tolower))
head(Cars93.lower$Type)
[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
head(Cars93.lower$Type_lower)
[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"

Example: mutate_at

Let’s convert from MPG to KPML but this time using mutate_at

Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), 
            list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
[1] "Luggage.room"     "Weight"           "Origin"           "Make"            
[5] "MPG.city_KMPL"    "MPG.highway_KMPL"

Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for

function(.x){0.425 * .x}

Example: summarize_if

Let’s get the mean of every numeric column in Cars93

Cars93 %>% summarize_if(is.numeric, mean)
  Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
       RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
     Width Turn.circle Rear.seat.room Luggage.room   Weight
1 69.37634    38.95699             NA           NA 3072.903
Cars93 %>% summarize_if(is.numeric, list(mean = mean), na.rm=TRUE)
  Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
1       17.12581   19.50968       21.89892      22.36559         29.08602
  EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
1        2.667742         143.828 5280.645          2332.204
  Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean Width_mean
1                16.66452        5.086022    183.2043       103.9462   69.37634
  Turn.circle_mean Rear.seat.room_mean Luggage.room_mean Weight_mean
1         38.95699            27.82967          13.89024    3072.903

Example: summarize_at

Let’s get the average fuel economy of all vehicles, grouped by their Type

We’ll use the helper function contains() to indicate that we want all variables that contain MPG.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(c("MPG.city", "MPG.highway"), mean)
# A tibble: 6 x 3
  Type    MPG.city MPG.highway
* <fct>      <dbl>       <dbl>
1 Compact     22.7        29.9
2 Large       18.4        26.7
3 Midsize     19.5        26.7
4 Small       29.9        35.5
5 Sporty      21.8        28.8
6 Van         17          21.9

Another approach

We might learn about a bunch of helper functions like contains() and starts_with(). Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.

Cars93 %>%
  group_by(Type) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
# A tibble: 6 x 3
  Type    MPG.city_mean MPG.highway_mean
* <fct>           <dbl>            <dbl>
1 Compact          22.7             29.9
2 Large            18.4             26.7
3 Midsize          19.5             26.7
4 Small            29.9             35.5
5 Sporty           21.8             28.8
6 Van              17               21.9

More than one grouping variable

Cars93 %>%
  group_by(Origin, AirBags) %>%
  summarize_at(vars(contains("MPG")), list(mean = mean))
# A tibble: 6 x 4
# Groups:   Origin [2]
  Origin  AirBags            MPG.city_mean MPG.highway_mean
  <fct>   <fct>                      <dbl>            <dbl>
1 USA     Driver & Passenger          19               27.2
2 USA     Driver only                 20.2             27.5
3 USA     None                        23.1             29.6
4 non-USA Driver & Passenger          20.3             27  
5 non-USA Driver only                 23.2             29.4
6 non-USA None                        25.9             32  

License

This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

This set of notes is adapted from Prof. Alexandra Chouldechova at CMU, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


  1. Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎