Data Challenge Lab Home

Vector functions [program]

(Builds on: Manipulation basics, Function basics)
(Leads to: purrr inside mutate)

Vector functions take a vector as input and produce a vector of the same length as output. This is very helpful when working with vectors. For example, instead of taking the log of each element of the vector x, you can just call log10() once:

x <- c(5, 2, 1)

log10(x)
#> [1] 0.69897 0.30103 0.00000

The simple mathematical operators are also vector functions:

y <- c(1, 2, 4)

x + y
#> [1] 6 4 5

x * y
#> [1] 5 4 4

In contrast, functions that can only take a length one input and produce a length one output are called scalar functions.

As you’ll see in the next section, the distinction between scalar and vector functions is important when working with tibbles.

Temperature recommendations

A common way to create a scalar function is by using an if-else statement. For example, you might write the following function that tells you what to do based on the temperature outside:

recommendation_1 <- function(x) {
  if (x >= 90) {
    "locate air conditioning"
  } else if (x >= 60) {
    "go outside"
  } else if (x >= 30) {
    "wear a jacket"
  } else if (x >= 0) {
    "wear multiple jackets"
  } else {
    "move"
  }
}

This works well when applied to single values:

recommendation_1(92)
#> [1] "locate air conditioning"

recommendation_1(34)
#> [1] "wear a jacket"

recommendation_1(-15)
#> [1] "move"

but fails when applied to a vector with more than one element:

temps <- c(1, 55, 101)

recommendation_1(temps)
#> Warning in if (x >= 90) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 60) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 30) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 0) {: the condition has length > 1 and only the first
#> element will be used
#> [1] "wear multiple jackets"

if only works with one element at a time and can’t handle an entire vector. When you give recommendation_1() a vector, it only processes the first element of that vector, which is why recommendation_1() only tells us what to do if it’s 1 degree outside.

Vector functions and mutate()

mutate() creates a value for each row in a tibble. If you want, you can manually give mutate() a vector with a value for each row:

set.seed(523)

df <- tibble(
  temperature = sample(x = -15:110, size = 10, replace = TRUE)
)

df %>% 
  mutate(new_column = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
#> # A tibble: 10 x 2
#>   temperature new_column
#>         <int>      <dbl>
#> 1           6          1
#> 2         106          2
#> 3          80          3
#> 4          93          4
#> 5          48          5
#> # … with 5 more rows

You can also give mutate() a single value:

df %>% 
  mutate(one_value = 1)
#> # A tibble: 10 x 2
#>   temperature one_value
#>         <int>     <dbl>
#> 1           6         1
#> 2         106         1
#> 3          80         1
#> 4          93         1
#> 5          48         1
#> # … with 5 more rows

and it will repeat that value for each row in the tibble. However, if you try to give mutate() a vector with a length other than 1 or nrow(df), you’ll get an error:

df %>% 
  mutate(two_values = c(1, 2))
#> Error in mutate_impl(.data, dots): Column `two_values` must be length 10 (the number of rows) or one, not 2

As you know well by now, you’ll often create new columns by applying functions to existing ones:

fahrenheit_to_celcius <- function(degrees_fahrenheit) {
  (degrees_fahrenheit - 32) * (5 / 9)
}

df %>% 
  mutate(temperature_celcius = fahrenheit_to_celcius(temperature))
#> # A tibble: 10 x 2
#>   temperature temperature_celcius
#>         <int>               <dbl>
#> 1           6              -14.4 
#> 2         106               41.1 
#> 3          80               26.7 
#> 4          93               33.9 
#> 5          48                8.89
#> # … with 5 more rows

When you pass temperature to fahrenheit_to_celcius(), you pass the entire temperature column, which, as you learned earlier, is a vector. Because mathematical operations are vectorized, fahrenheit_to_celcius() returns a vector of the same length and mutate() successfully creates a new column.

You can probably predict now what will happen if we try to use our scalar function, recommendation_1(), in the same way:

df %>% 
  mutate(recommendation = recommendation_1(temperature))
#> Warning in if (x >= 90) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 60) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 30) {: the condition has length > 1 and only the first
#> element will be used
#> Warning in if (x >= 0) {: the condition has length > 1 and only the first
#> element will be used
#> # A tibble: 10 x 2
#>   temperature recommendation       
#>         <int> <chr>                
#> 1           6 wear multiple jackets
#> 2         106 wear multiple jackets
#> 3          80 wear multiple jackets
#> 4          93 wear multiple jackets
#> 5          48 wear multiple jackets
#> # … with 5 more rows

mutate() passes the entire temperature vector to recommendation_1(), which can’t handle a vector and so only processes the first element of temperature. However, because of how mutate() behaves when given a single value, the recommendation for the first temperature is copied for every single row, which isn’t very helpful.

Vectorizing if-else statements

There are several ways to vectorize recommendation_1() so that it gives an accurate recommendation for each temperature in df.

First, there’s a vectorized if-else function called if_else():

x <- c(1, 3, 4)

if_else(x == 4, true = "four", false = "not four")
#> [1] "not four" "not four" "four"

However, in order to rewrite recommendation_1() using if_else(), we’d need to nest if_else() repeatedly and the function would become difficult to read. Another vector function, case_when(), is a better option:

recommendation_2 <- function(x) {
  case_when(
    x >= 90 ~ "locate air conditioning",
    x >= 60 ~ "go outside",
    x >= 30 ~ "wear a jacket",
    x >= 0  ~ "wear multiple jackets",
    TRUE    ~ "move"
  )
}

recommendation_2(temps)
#> [1] "wear multiple jackets"   "wear a jacket"          
#> [3] "locate air conditioning"

df %>% 
  mutate(recommendation = recommendation_2(temperature))
#> # A tibble: 10 x 2
#>   temperature recommendation         
#>         <int> <chr>                  
#> 1           6 wear multiple jackets  
#> 2         106 locate air conditioning
#> 3          80 go outside             
#> 4          93 locate air conditioning
#> 5          48 wear a jacket          
#> # … with 5 more rows

For other helpful vector functions, take a look at the “Vector Functions” section of the dplyr cheat sheet.