Topic 1 Interlude: R - Basic cleaning, loops, and alternatives
Context
What is the purpose of these notes?
- Provide a few small examples of functions in R;
- Provide html/Markdown with several lines of
R
code you can use to practice writing functions.
Agenda
- A common data cleaning task
- For/while loops to iterate over data
- Helpful variants of
map
,mutate
andsummarize
A common problem: messy data
One of the most common problems you’ll encounter when importing manually-entered data is inconsistent data types within columns
For a simple example, let’s look at
TVhours
column in a messy version of the survey data from the following link http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv:
survey.messy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv",
header=TRUE)
survey.messy$TVhours
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
What’s happening?
'data.frame': 67 obs. of 6 variables:
$ Program : chr "PPM" "MISM" "MISM" "PPM" ...
$ PriorExp : chr "Some experience" "Some experience" "Some experience" "Some experience" ...
$ Rexperience : chr "Basic competence" "Never used" "Never used" "Basic competence" ...
$ OperatingSystem: chr "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
$ TVhours : chr "20" "6" "10" "2" ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
Several of the entries have non-numeric values in them (they contain strings)
As a result,
TVhours
is being imported asfactor
A look at the TVhours column
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
Attempt at a fix
- What if we just try to cast it back to numeric?
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
Warning: NAs introduced by coercion
[1] 20.0 6.0 10.0 2.0 NA 10.0 15.0 3.0 0.0 0.0 5.0 2.0 10.0 40.0 NA
[16] 5.0 3.0 20.0 0.0 10.0 2.0 10.0 8.0 8.0 2.0 12.0 5.0 6.0 4.0 4.0
[31] 0.0 5.0 0.0 0.0 0.0 4.0 2.0 3.0 14.0 0.0 3.0 7.0 0.0 3.0 7.0
[46] 4.0 5.0 1.5 4.0 NA 0.0 0.0 4.0 2.0 0.0 0.0 1.0 0.0 2.0 2.0
[61] 0.0 0.5 NA 3.0 6.0 1.0 5.0
That didn’t work…
[1] "20" "6" "10" "2" "none" "10" "15" "3" "0" "0"
[11] "5" "2" "10" "40" "zero" "5" "3" "20" "0" "10"
[21] "2" "10" "8" "8" "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4" "2" "3" "14" "0"
Warning in head(as.numeric(tv.hours.messy), 40): NAs introduced by coercion
[1] 20 6 10 2 NA 10 15 3 0 0 5 2 10 40 NA 5 3 20 0 10 2 10 8 8 2
[26] 12 5 6 4 4 0 5 0 0 0 4 2 3 14 0
This just converted all the values into the integer-coded levels of the factor
Not what we wanted!
Something that does work
- Consider the following simple example
[1] 3.1 2.5
Levels: 2.5 3.1
[1] 2 1
[1] 3.1 2.5
If we take a number that’s being coded as a factor and first turn it into a
character
string, then converting the string to a numeric gets back the number
Back to the corrupted TVhours column
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
Warning: NAs introduced by coercion
[1] 20.0 6.0 10.0 2.0 NA 10.0 15.0 3.0 0.0 0.0 5.0 2.0 10.0 40.0 NA
[16] 5.0 3.0 20.0 0.0 10.0 2.0 10.0 8.0 8.0 2.0 12.0 5.0 6.0 4.0 4.0
[31] 0.0 5.0 0.0 0.0 0.0 4.0 2.0 3.0 14.0 0.0 3.0 7.0 0.0 3.0 7.0
[46] 4.0 5.0 1.5 4.0 NA 0.0 0.0 4.0 2.0 0.0 0.0 1.0 0.0 2.0 2.0
[61] 0.0 0.5 NA 3.0 6.0 1.0 5.0
Warning in typeof(as.numeric(as.character(tv.hours.messy))): NAs introduced by
coercion
[1] "double"
A small improvement
All the corrupted cells now appear as
NA
, which is R’s missing indicatorWe can do a little better by cleaning up the vector once we get it to
character
form
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
Deleting non-numeric (or .) characters
[1] "20" "6" "10" "2" "none" "10"
[7] "15" "3" "0" "0" "5" "2"
[13] "10" "40" "zero" "5" "3" "20"
[19] "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4"
[31] "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7"
[43] "0" "3" "7" "4" "5" "1.5"
[49] "4" "approx 10" "0" "0" "4" "2"
[55] "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6ish" "3" "6" "1"
[67] "5"
# Use gsub() to replace everything except digits and '.' with a blank ""
gsub("[^0-9.]", "", tv.hours.strings)
[1] "20" "6" "10" "2" "" "10" "15" "3" "0" "0" "5" "2"
[13] "10" "40" "" "5" "3" "20" "0" "10" "2" "10" "8" "8"
[25] "2" "12" "5" "6" "4" "4" "0" "5" "0" "0" "0" "4"
[37] "2" "3" "14" "0" "3" "7" "0" "3" "7" "4" "5" "1.5"
[49] "4" "10" "0" "0" "4" "2" "0" "0" "1" "0" "2" "2"
[61] "0" "0.5" "6" "3" "6" "1" "5"
The final product
[1] "20" "6" "10" "2" "none" "10" "15" "3" "0" "0"
[11] "5" "2" "10" "40" "zero" "5" "3" "20" "0" "10"
[21] "2" "10" "8" "8" "2" "12" "5" "6" "4" "4"
[1] 20.0 6.0 10.0 2.0 NA 10.0 15.0 3.0 0.0 0.0 5.0 2.0 10.0 40.0 NA
[16] 5.0 3.0 20.0 0.0 10.0 2.0 10.0 8.0 8.0 2.0 12.0 5.0 6.0 4.0 4.0
[31] 0.0 5.0 0.0 0.0 0.0 4.0 2.0 3.0 14.0 0.0 3.0 7.0 0.0 3.0 7.0
[46] 4.0 5.0 1.5 4.0 10.0 0.0 0.0 4.0 2.0 0.0 0.0 1.0 0.0 2.0 2.0
[61] 0.0 0.5 6.0 3.0 6.0 1.0 5.0
- As a last step, we should go through and figure out if any of the
NA
values should really be0
.- This step is not shown here.
Rebuilding our data
'data.frame': 67 obs. of 6 variables:
$ Program : chr "PPM" "MISM" "MISM" "PPM" ...
$ PriorExp : chr "Some experience" "Some experience" "Some experience" "Some experience" ...
$ Rexperience : chr "Basic competence" "Never used" "Never used" "Basic competence" ...
$ OperatingSystem: chr "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
$ TVhours : num 20 6 10 2 NA 10 15 3 0 0 ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
- Success!
A different approach
- We can also handle this problem by setting
stringsAsFactors = FALSE
when importing our data.
survey.meayssy <- read.csv("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2019_messy.csv",
header=TRUE, stringsAsFactors=FALSE)
str(survey.messy)
'data.frame': 67 obs. of 6 variables:
$ Program : chr "PPM" "MISM" "MISM" "PPM" ...
$ PriorExp : chr "Some experience" "Some experience" "Some experience" "Some experience" ...
$ Rexperience : chr "Basic competence" "Never used" "Never used" "Basic competence" ...
$ OperatingSystem: chr "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
$ TVhours : chr "20" "6" "10" "2" ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
- Now everything is a
character
instead of afactor
One-line cleanup
- Let’s clean up the
TVhours
column and cast it to numeric all in one command
'data.frame': 67 obs. of 6 variables:
$ Program : chr "PPM" "MISM" "MISM" "PPM" ...
$ PriorExp : chr "Some experience" "Some experience" "Some experience" "Some experience" ...
$ Rexperience : chr "Basic competence" "Never used" "Never used" "Basic competence" ...
$ OperatingSystem: chr "Windows" "Mac OS X" "Mac OS X" "Mac OS X" ...
$ TVhours : num 20 6 10 2 NA 10 15 3 0 0 ...
$ Editor : chr "Microsoft Word" "Microsoft Word" "Microsoft Word" "R Markdown" ...
What about all those other character
variables?
MISM Other PPM
8 17 42
MISM Other PPM
8 17 42
- Having factors coded as characters may be OK for many parts of our analysis
If we wanted to, here’s how we could fix things
mutate_if(.tbl, .predicate, .funs)
applies a functions .funs
to the elements of .tbl
for which the predicate (condition) .predicate
holds.
Here is how we can use mutate_if
to convert all character columns to factors.
'data.frame': 67 obs. of 6 variables:
$ Program : Factor w/ 3 levels "MISM","Other",..: 3 1 1 3 3 3 3 3 3 3 ...
$ PriorExp : Factor w/ 3 levels "Extensive experience",..: 3 3 3 3 3 3 2 3 3 3 ...
$ Rexperience : Factor w/ 4 levels "Basic competence",..: 1 4 4 1 3 1 3 1 3 4 ...
$ OperatingSystem: Factor w/ 2 levels "Mac OS X","Windows": 2 1 1 1 2 1 2 2 2 2 ...
$ TVhours : num 20 6 10 2 NA 10 15 3 0 0 ...
$ Editor : Factor w/ 5 levels "Jupyter Notebook",..: 3 3 3 4 3 3 3 3 3 3 ...
- Success!
Another common problem
In various homework assignments, you’ll learn how to wrangle with another common problem
When data is entered manually, misspellings and case changes are very common
E.g., a column showing treatment information may look like,
treatment <- as.factor(c("dialysis", "Ventilation", "Dialysis", "dialysis", "none", "None", "nnone", "dyalysis", "dialysis", "ventilation", "none"))
summary(treatment)
dialysis Dialysis dyalysis nnone none None
3 1 1 1 2 1
ventilation Ventilation
1 1
dialysis Dialysis dyalysis nnone none None
3 1 1 1 2 1
ventilation Ventilation
1 1
This factor has 8 levels even though it should have 3 (dialysis, ventilation, none)
We can fix many of the typos by running spellcheck in Excel before importing data, or by changing the values on a case-by-case basis later
There’s a faster way to fix just the capitalization issue (this will be an exercise in one of the HW sets)
What are all these map<*>
functions?
These are all efficient ways of applying a function to margins of an array or elements of a list
Before we talk about them in detail, we need to understand their more cumbersome but more general alternative: loops
loops are ways of iterating over data
The
map<*>
functions and their<*>apply
base-R ancestors can be thought of as good alternatives to loops
for
loops
For loops: a pair of examples
[1] 1
[1] 2
[1] 3
[1] 4
phrase <- "Good Night,"
for(word in c("and", "Good", "Luck")) {
phrase <- paste(phrase, word)
print(phrase)
}
[1] "Good Night, and"
[1] "Good Night, and Good"
[1] "Good Night, and Good Luck"
For loops: syntax
A for loop executes a chunk of code for every value of an index variable in an index set
- The basic syntax takes the form
- The index set is often a vector of integers, but can be more general
Example
index.set <- list(name="Michael", weight=185, is.male=TRUE) # a list
for(i in index.set) {
print(c(i, typeof(i)))
}
[1] "Michael" "character"
[1] "185" "double"
[1] "TRUE" "logical"
Example: Calculate sum of each column
fake.data <- matrix(rnorm(500), ncol=5) # create fake 100 x 5 data set
head(fake.data,2) # print first two rows
[,1] [,2] [,3] [,4] [,5]
[1,] -1.146664 -0.7225478 0.3190678 0.8936148 -0.1715746
[2,] -1.534036 -0.2740636 1.4878076 1.0369705 1.6525823
col.sums <- numeric(ncol(fake.data)) # variable to store running column sums
for(i in 1:nrow(fake.data)) {
col.sums <- col.sums + fake.data[i,] # add ith observation to the sum
}
col.sums
[1] 2.497672 6.612204 -27.731727 -7.397220 -14.098159
[1] 2.497672 6.612204 -27.731727 -7.397220 -14.098159
while
loops
- while loops repeat a chunk of code while the specified condition remains true
We won’t really be using while loops in this class
Just be aware that they exist, and that they may become useful to you at some point in your analytics career
Loop alternatives
Command | Description |
---|---|
apply(X, MARGIN, FUN) |
Obtain a vector/array/list by applying FUN along the specified MARGIN of an array or matrix X |
map(.x, .f, ...) |
Obtain a list by applying .f to every element of a list or atomic vector .x |
map_<type>(.x, .f, ...) |
For <type> given by lgl (logical), int (integer), dbl (double) or chr (character), return a vector of this type obtained by applying .f to each element of .x |
map_at(.x, .at, .f) |
Obtain a list by applying .f to the elements of .x specified by name or index given in .at |
map_if(.x, .p, .f) |
Obtain a list .f to the elements of .x specified by .p (a predicate function, or a logical vector) |
mutate_all/_at/_if |
Mutate all variables, specified (at) variables, or those selected by a predicate (if) |
summarize_all/_at/_if |
Summarize all variables, specified variables, or those selected by a predicate (if) |
These take practice to get used to, but make analysis easier to debug and less prone to error when used effectively
The best way to learn them is by looking at a bunch of examples. The end of each help file contains some examples.
Example: apply()
[1] 0.02497672 0.06612204 -0.27731727 -0.07397220 -0.14098159
[1] 0.02497672 0.06612204 -0.27731727 -0.07397220 -0.14098159
# MARGIN = 1 for rows, 2 for columns
# Function that calculates proportion
# of vector indices that are > 0
propPositive <- function(x) mean(x > 0)
apply(fake.data, MARGIN=2, FUN=propPositive)
[1] 0.44 0.49 0.42 0.47 0.39
Example: map, map_()
$Program
[1] TRUE
$PriorExp
[1] TRUE
$Rexperience
[1] TRUE
$OperatingSystem
[1] TRUE
$TVhours
[1] FALSE
$Editor
[1] TRUE
Program PriorExp Rexperience OperatingSystem TVhours
TRUE TRUE TRUE TRUE FALSE
Editor
TRUE
Example: apply(), map(), map_()
speed dist
15.40 42.98
$speed
[1] 15.4
$dist
[1] 42.98
speed dist
15.40 42.98
Example: mutate_if
Let’s convert all factor variables in Cars93 to lowercase
[1] Small Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
[1] "small" "midsize" "compact" "midsize" "midsize" "midsize"
Note: this operation is performed “in place”, replacing prior versions of the variables
Example: mutate_if, adding instead of replacing columns
If you pass the functions in as a list with named elements, those names get appended to create modified versions of variables instead of replacing existing variables
[1] Small Midsize Compact Midsize Midsize Midsize
Levels: Compact Large Midsize Small Sporty Van
[1] "small" "midsize" "compact" "midsize" "midsize" "midsize"
Example: mutate_at
Let’s convert from MPG to KPML but this time using mutate_at
Cars93.metric <- Cars93 %>%
mutate_at(c("MPG.city", "MPG.highway"),
list(KMPL = ~ 0.425 * .x))
tail(colnames(Cars93.metric))
[1] "Luggage.room" "Weight" "Origin" "Make"
[5] "MPG.city_KMPL" "MPG.highway_KMPL"
Here, ~ 0.425 * .x
is an example of specifying a “lambda” (anonymous) function. It is permitted short-hand for
Example: summarize_if
Let’s get the mean of every numeric column in Cars93
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower
1 17.12581 19.50968 21.89892 22.36559 29.08602 2.667742 143.828
RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase
1 5280.645 2332.204 16.66452 5.086022 183.2043 103.9462
Width Turn.circle Rear.seat.room Luggage.room Weight
1 69.37634 38.95699 NA NA 3072.903
Min.Price_mean Price_mean Max.Price_mean MPG.city_mean MPG.highway_mean
1 17.12581 19.50968 21.89892 22.36559 29.08602
EngineSize_mean Horsepower_mean RPM_mean Rev.per.mile_mean
1 2.667742 143.828 5280.645 2332.204
Fuel.tank.capacity_mean Passengers_mean Length_mean Wheelbase_mean Width_mean
1 16.66452 5.086022 183.2043 103.9462 69.37634
Turn.circle_mean Rear.seat.room_mean Luggage.room_mean Weight_mean
1 38.95699 27.82967 13.89024 3072.903
Example: summarize_at
Let’s get the average fuel economy of all vehicles, grouped by their Type
We’ll use the helper function contains()
to indicate that we want all variables that contain MPG.
# A tibble: 6 x 3
Type MPG.city MPG.highway
* <fct> <dbl> <dbl>
1 Compact 22.7 29.9
2 Large 18.4 26.7
3 Midsize 19.5 26.7
4 Small 29.9 35.5
5 Sporty 21.8 28.8
6 Van 17 21.9
Another approach
We might learn about a bunch of helper functions like contains()
and starts_with()
. Here’s one way of performing the previous operation with the help of these functions, and appending _mean
to the resulting output.
# A tibble: 6 x 3
Type MPG.city_mean MPG.highway_mean
* <fct> <dbl> <dbl>
1 Compact 22.7 29.9
2 Large 18.4 26.7
3 Midsize 19.5 26.7
4 Small 29.9 35.5
5 Sporty 21.8 28.8
6 Van 17 21.9
More than one grouping variable
# A tibble: 6 x 4
# Groups: Origin [2]
Origin AirBags MPG.city_mean MPG.highway_mean
<fct> <fct> <dbl> <dbl>
1 USA Driver & Passenger 19 27.2
2 USA Driver only 20.2 27.5
3 USA None 23.1 29.6
4 non-USA Driver & Passenger 20.3 27
5 non-USA Driver only 23.2 29.4
6 non-USA None 25.9 32
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This set of notes is adapted from Prof. Alexandra Chouldechova at CMU, under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.