Topic 1 Interlude: R - Basic cleaning, loops, and alternatives
Context
What is the purpose of these notes?
- Provide a few small examples of functions in R;
 - Provide html/Markdown with several lines of 
Rcode you can use to practice writing functions. 
Agenda
- A common data cleaning task
 - For/while loops to iterate over data
 - Helpful variants of 
map,mutateandsummarize 
A common problem: messy data
One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns
For a simple example, let’s look at
TVhourscolumn in a messy version of the survey data from the following link http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv:
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE)
survey.messy$TVhours [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
What’s happening?
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
Several of the entries have non-numeric values in them (they contain strings)
As a result,
TVhoursis being imported asfactor
A look at the TVhours column
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
Attempt at a fix
- What if we just try to cast it back to numeric?
 
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
Warning: NAs introduced by coercion
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0
That didn’t work…
 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"   
[31] "0"    "5"    "0"    "0"    "0"    "4"    "2"    "3"    "14"   "0"   
Warning in head(as.numeric(tv.hours.messy), 40): NAs introduced by coercion
 [1] 20  6 10  2 NA 10 15  3  0  0  5  2 10 40 NA  5  3 20  0 10  2 10  8  8  2
[26] 12  5  6  4  4  0  5  0  0  0  4  2  3 14  0
This just converted all the values into the integer-coded levels of the factor
Not what we wanted!
Something that does work
- Consider the following simple example
 
[1] 3.1 2.5
Levels: 2.5 3.1
[1] 2 1
[1] 3.1 2.5
If we take a number that’s being coded as a factor and first turn it into a
characterstring, then converting the string to a numeric gets back the number
Back to the corrupted TVhours column
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
Warning: NAs introduced by coercion
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0   NA  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5   NA  3.0  6.0  1.0  5.0
Warning in typeof(as.numeric(as.character(tv.hours.messy))): NAs introduced by
coercion
[1] "double"
A small improvement
All the corrupted cells now appear as
NA, which is R’s missing indicatorWe can do a little better by cleaning up the vector once we get it to
characterform
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
Deleting non-numeric (or .) characters
 [1] "20"        "6"         "10"        "2"         "none"      "10"       
 [7] "15"        "3"         "0"         "0"         "5"         "2"        
[13] "10"        "40"        "zero"      "5"         "3"         "20"       
[19] "0"         "10"        "2"         "10"        "8"         "8"        
[25] "2"         "12"        "5"         "6"         "4"         "4"        
[31] "0"         "5"         "0"         "0"         "0"         "4"        
[37] "2"         "3"         "14"        "0"         "3"         "7"        
[43] "0"         "3"         "7"         "4"         "5"         "1.5"      
[49] "4"         "approx 10" "0"         "0"         "4"         "2"        
[55] "0"         "0"         "1"         "0"         "2"         "2"        
[61] "0"         "0.5"       "6ish"      "3"         "6"         "1"        
[67] "5"        
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)  [1] "20"  "6"   "10"  "2"   ""    "10"  "15"  "3"   "0"   "0"   "5"   "2"  
[13] "10"  "40"  ""    "5"   "3"   "20"  "0"   "10"  "2"   "10"  "8"   "8"  
[25] "2"   "12"  "5"   "6"   "4"   "4"   "0"   "5"   "0"   "0"   "0"   "4"  
[37] "2"   "3"   "14"  "0"   "3"   "7"   "0"   "3"   "7"   "4"   "5"   "1.5"
[49] "4"   "10"  "0"   "0"   "4"   "2"   "0"   "0"   "1"   "0"   "2"   "2"  
[61] "0"   "0.5" "6"   "3"   "6"   "1"   "5"  
The final product
 [1] "20"   "6"    "10"   "2"    "none" "10"   "15"   "3"    "0"    "0"   
[11] "5"    "2"    "10"   "40"   "zero" "5"    "3"    "20"   "0"    "10"  
[21] "2"    "10"   "8"    "8"    "2"    "12"   "5"    "6"    "4"    "4"   
 [1] 20.0  6.0 10.0  2.0   NA 10.0 15.0  3.0  0.0  0.0  5.0  2.0 10.0 40.0   NA
[16]  5.0  3.0 20.0  0.0 10.0  2.0 10.0  8.0  8.0  2.0 12.0  5.0  6.0  4.0  4.0
[31]  0.0  5.0  0.0  0.0  0.0  4.0  2.0  3.0 14.0  0.0  3.0  7.0  0.0  3.0  7.0
[46]  4.0  5.0  1.5  4.0 10.0  0.0  0.0  4.0  2.0  0.0  0.0  1.0  0.0  2.0  2.0
[61]  0.0  0.5  6.0  3.0  6.0  1.0  5.0
- As a last step, we should go through and figure out if any of the 
NAvalues should really be0.- This step is not shown here.
 
 
Rebuilding our data
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
- Success!
 
A different approach
- We can also handle this problem by setting 
stringsAsFactors = FALSEwhen importing our data. 
survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv", 
                         header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : chr  "20" "6" "10" "2" ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
- Now everything is a 
characterinstead of afactor 
One-line cleanup
- Let’s clean up the 
TVhourscolumn and cast it to numeric all in one command 
'data.frame':   67 obs. of  6 variables:
 $ Program        : chr  "PPM" "MISM" "MISM" "PPM" ...
 $ PriorExp       : chr  "Some experience" "Some experience" "Some experience" "Some experience" ...
 $ Rexperience    : chr  "Basic competence" "Never used" "Never used" "Basic competence" ...
 $ OperatingSystem: chr  "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : chr  "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
What about all those other character variables?
 MISM Other   PPM 
    8    17    42 
 MISM Other   PPM 
    8    17    42 
- Having factors coded as characters may be OK for many parts of our analysis
 
If we wanted to, here’s how we could fix things
mutate_if(.tbl, .predicate, .funs) applies a functions .funs to the elements of .tbl for which the predicate (condition) .predicate holds.
Here is how we can use mutate_if to convert all character columns to factors.
'data.frame':   67 obs. of  6 variables:
 $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
 $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
 $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
 $ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
 $ TVhours        : num  20 6 10 2 NA 10 15 3 0 0 ...
 $ Editor         : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
- Success!
 
Another common problem
In various homework assignments, you’ll learn how to wrangle with another common problem
When data is entered manually, misspellings and case changes are very common
E.g., a column showing treatment information may look like,
treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 
   dialysis    Dialysis    dyalysis       nnone        none        None 
          3           1           1           1           2           1 
ventilation Ventilation 
          1           1 
This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There’s a faster way to fix just the capitalization issue (this will be an exercise in one of the HW sets)
What are all these map<*> functions?
These are all efficient ways of applying a function to margins of an array or elements of a list
Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops
loops are ways of iterating over data
The
map<*>functions and their<*>applybase-R ancestors can be thought of as good alternatives to loops
for loops
For loops: a pair of examples
[1] 1
[1] 2
[1] 3
[1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
  phrase <- paste(phrase, word)
  print(phrase)
}[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"
For loops: syntax
A for loop executes a chunk of code for every value of an index variable in an index set
- The basic syntax takes the form
 
- The index set is often a vector of integers, but can be more general
 
Example
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
  print(c(i, typeof(i)))
}[1] "Michael"   "character"
[1] "185"    "double"
[1] "TRUE"    "logical"
Example: Calculate sum of each column
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows          [,1]       [,2]      [,3]      [,4]       [,5]
[1,] -1.146664 -0.7225478 0.3190678 0.8936148 -0.1715746
[2,] -1.534036 -0.2740636 1.4878076 1.0369705  1.6525823
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
  col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159
[1]   2.497672   6.612204 -27.731727  -7.397220 -14.098159
while loops
- while loops repeat a chunk of code while the specified condition remains true
 
We won’t really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career
Loop alternatives
| Command | Description | 
|---|---|
apply(X, MARGIN, FUN) | 
Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X | 
map(.x, .f, ...) | 
Obtain a list by applying .f to every element of a list or atomic vector .x | 
map_<type>(.x, .f, ...) | 
For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x | 
map_at(.x, .at, .f) | 
Obtain a list by applying .f to the elements of .x specified by name or index given in .at | 
map_if(.x, .p, .f) | 
Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector) | 
mutate_all/_at/_if | 
Mutate all variables, specified (at) variables, or those selected by a predicate (if) | 
summarize_all/_at/_if | 
Summarize all variables, specified variables, or those selected by a predicate (if) | 
These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.
Example: apply()
[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159
[1]  0.02497672  0.06612204 -0.27731727 -0.07397220 -0.14098159
# MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion
# of vector indices that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive) [1] 0.44 0.49 0.42 0.47 0.39
Example: map, map_() 
$Program
[1] TRUE
$PriorExp
[1] TRUE
$Rexperience
[1] TRUE
$OperatingSystem
[1] TRUE
$TVhours
[1] FALSE
$Editor
[1] TRUE
        Program        PriorExp     Rexperience OperatingSystem         TVhours 
           TRUE            TRUE            TRUE            TRUE           FALSE 
         Editor 
           TRUE 
Example: apply(), map(), map_() 
speed  dist 
15.40 42.98 
$speed
[1] 15.4
$dist
[1] 42.98
speed  dist 
15.40 42.98 
Example: mutate_if
Let’s convert all factor variables in Cars93 to lowercase
[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"
Note: this operation is performed “in place”, replacing prior versions of the variables
Example: mutate_if, adding instead of replacing columns
If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables
[1] Small   Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
[1] "small"   "midsize" "compact" "midsize" "midsize" "midsize"
Example: mutate_at
Let’s convert from MPG to KPML but this time using mutate_at
Cars93.metric <- Cars93 %>% 
  mutate_at(c("MPG.city", "MPG.highway"), 
            list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))[1] "Luggage.room"     "Weight"           "Origin"           "Make"            
[5] "MPG.city_KMPL"    "MPG.highway_KMPL"
Here, ~ 0.425 * .x is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for
Example: summarize_if
Let’s get the mean of every numeric column in Cars93
  Min.Price    Price Max.Price MPG.city MPG.highway EngineSize Horsepower
1  17.12581 19.50968  21.89892 22.36559    29.08602   2.667742    143.828
       RPM Rev.per.mile Fuel.tank.capacity Passengers   Length Wheelbase
1 5280.645     2332.204           16.66452   5.086022 183.2043  103.9462
     Width Turn.circle Rear.seat.room Luggage.room   Weight
1 69.37634    38.95699             NA           NA 3072.903
  Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
1       17.12581   19.50968       21.89892      22.36559         29.08602
  EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
1        2.667742         143.828 5280.645          2332.204
  Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean Width_mean
1                16.66452        5.086022    183.2043       103.9462   69.37634
  Turn.circle_mean Rear.seat.room_mean Luggage.room_mean Weight_mean
1         38.95699            27.82967          13.89024    3072.903
Example: summarize_at
Let’s get the average fuel economy of all vehicles, grouped by their Type
We’ll use the helper function contains() to indicate that we want all variables that contain MPG.
# A tibble: 6 x 3
  Type    MPG.city MPG.highway
* <fct>      <dbl>       <dbl>
1 Compact     22.7        29.9
2 Large       18.4        26.7
3 Midsize     19.5        26.7
4 Small       29.9        35.5
5 Sporty      21.8        28.8
6 Van         17          21.9
Another approach
We might learn about a bunch of helper functions like contains() and starts_with(). Here’s one way of performing the previous operation with the help of these functions, and appending _mean to the resulting output.
# A tibble: 6 x 3
  Type    MPG.city_mean MPG.highway_mean
* <fct>           <dbl>            <dbl>
1 Compact          22.7             29.9
2 Large            18.4             26.7
3 Midsize          19.5             26.7
4 Small            29.9             35.5
5 Sporty           21.8             28.8
6 Van              17               21.9
More than one grouping variable
# A tibble: 6 x 4
# Groups:   Origin [2]
  Origin  AirBags            MPG.city_mean MPG.highway_mean
  <fct>   <fct>                      <dbl>            <dbl>
1 USA     Driver & Passenger          19               27.2
2 USA     Driver only                 20.2             27.5
3 USA     None                        23.1             29.6
4 non-USA Driver & Passenger          20.3             27  
5 non-USA Driver only                 23.2             29.4
6 non-USA None                        25.9             32  
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This set of notes is adapted from Prof. Alexandra Chouldechova at CMU, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.