Variables in R can be of different types. For example, we need to distinguish numbers from character strings and tables from simple lists of numbers. The function class helps us determine what type of object we have:

a <- 2
class(a)
#> [1] "numeric"

In this table you can find the 3 most important basic datatypes:

Datatype	Examples
Logical	T, F, TRUE, FALSE, 1==1, 5>2, ...
numeric	2.65, pi, 5, 10/3, ...
character	"a", "apples", "1*6-5" , "5", "pi", "TRUE", ...

advanced: There are 2 more basic datatypes, complex and raw for representing respectively complex numbers and bit representations. These datatypes will not be used in this course.

note that there can be an infinite number of datatypes because more complex datatypes like dataframes, lists, etc are based on a collection of more basic datatypes. But no matter how complex the datatype, at its base you will allways find the basic datatypes stated above.

To work efficiently in R, it is important to learn the different types of variables and what we can do with these.

Data frames

Up to now, the variables we have defined are just one number. This is not very useful for storing data. The most common way of storing a dataset in R is in a data frame. Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.

A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading the dslabs library and loading the murders dataset using the data function:

library(dslabs)
data(murders)

To see that this is in fact a data frame, we type:

class(murders)
#> [1] "data.frame"

Examining an object

The function str is useful for finding out more about the structure of an object:

str(murders)
#> 'data.frame':    51 obs. of  5 variables:
#> $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
#> $ abb : chr "AL" "AK" "AZ" "AR" ...
#> $ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2
#>    2 ...
#> $ population: num 4779736 710231 6392017 2915918 37253956 ...
#> $ total : num 135 19 232 93 1257 ...

This tells us much more about the object. We see that the table has 51 rows (50 states plus DC) and five variables. We can show the first six lines using the function head:

head(murders)
#>        state abb region population total
#> 1    Alabama  AL  South    4779736   135
#> 2     Alaska  AK   West     710231    19
#> 3    Arizona  AZ   West    6392017   232
#> 4   Arkansas  AR  South    2915918    93
#> 5 California  CA   West   37253956  1257
#> 6   Colorado  CO   West    5029196    65

In this dataset, each state is considered an observation and five variables are reported for each state.

Before we go any further in answering our original question about different states, let’s learn more about the components of this object.

The accessor: `$`

For our analysis, we will need to access the different variables represented by columns included in this data frame. To do this, we use the accessor operator $ in the following way:

murders$population
#>  [1]  4779736   710231  6392017  2915918 37253956  5029196  3574097
#>  [8]   897934   601723 19687653  9920000  1360301  1567582 12830632
#> [15]  6483802  3046355  2853118  4339367  4533372  1328361  5773552
#> [22]  6547629  9883640  5303925  2967297  5988927   989415  1826341
#> [29]  2700551  1316470  8791894  2059179 19378102  9535483   672591
#> [36] 11536504  3751351  3831074 12702379  1052567  4625364   814180
#> [43]  6346105 25145561  2763885   625741  8001024  6724540  1852994
#> [50]  5686986   563626

But how did we know to use population? Previously, by applying the function str to the object murders, we revealed the names for each of the five variables stored in this table. We can quickly access the variable names using:

names(murders)
#> [1] "state"      "abb"        "region"     "population" "total"

It is important to know that the order of the entries in murders$population preserves the order of the rows in our data table. This will later permit us to manipulate one variable based on the results of another. For example, we will be able to order the state names by the number of murders.

Tip: R comes with a very nice auto-complete functionality that saves us the trouble of typing out all the names. Try typing murders$p then hitting the tab key on your keyboard. This functionality and many other useful auto-complete features are available when working in RStudio.

Vectors: numerics, characters, and logical

The object murders$population is not one number but several. We call these types of objects vectors. A single number, character or any other object is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function length tells you how many entries are in the vector:

pop <- murders$population
length(pop)
#> [1] 51

This particular vector is numeric since population sizes are numbers:

class(pop)
#> [1] "numeric"

In a numeric vector, every entry must be a number.

To store character strings, vectors can also be of class character. For example, the state names are characters:

class(murders$state)
#> [1] "character"

As with numeric vectors, all entries in a character vector need to be a character.

Another important type of vectors are logical vectors. These must be either TRUE or FALSE.

z <- 3 == 2
z
#> [1] FALSE
class(z)
#> [1] "logical"

Here the == is a relational operator asking if 3 is equal to 2. In R, if you just use one =, you actually assign a variable, but if you use two == you test for equality.

You can see the other relational operators by typing:

?Comparison

In future sections, you will see how useful relational operators can be.

We discuss more important features of vectors after the next set of exercises.

Advanced: Mathematically, the values in pop are integers and there is an integer class in R. However, by default, numbers are assigned class numeric even when they are round integers. For example, class(1) returns numeric. You can turn them into class integer with the as.integer() function or by adding an L like this: 1L. Note the class by typing: class(1L)

Factors

In the murders dataset, we might expect the region to also be a character vector. However, it is not:

class(murders$region)
#> [1] "factor"

It is a factor. Factors are useful for storing categorical data. We can see that there are only 4 regions by using the levels function:

levels(murders$region)
#> [1] "Northeast"     "South"         "North Central" "West"

In the background, R stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters.

Note that the levels have an order that is different from the order of appearance in the factor object. The default in R is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the levels argument when creating the factor with the factor function. For example, in the murders dataset regions are ordered from east to west. The function reorder lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.

Suppose we want the levels of the region by the total number of murders rather than alphabetical order. If there are values associated with each level, we can use the reorder and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.

region <- murders$region
value <- murders$total
region <- reorder(region, value, FUN = sum)
levels(region)
#> [1] "Northeast"     "North Central" "West"          "South"

The new order is in agreement with the fact that the Northeast has the least murders and the South has the most.

Warning: Factors can be a source of confusion since sometimes they behave like characters and sometimes they do not. As a result, confusing factors and characters are a common source of bugs.

Lists

Data frames are a special case of lists. We will cover lists in more detail later, but know that they are useful because you can store any combination of different types. Below is an example of a list we created for you:

record
#> $name
#> [1] "John Doe"
#>
#> $student_id
#> [1] 1234
#>
#> $grades
#> [1] 95 82 91 97 93
#>
#> $final_grade
#> [1] "A"
class(record)
#> [1] "list"

As with data frames, you can extract the components of a list with the accessor $. In fact, data frames are a type of list.

record$student_id
#> [1] 1234

We can also use double square brackets ([[) like this:

record[["student_id"]]
#> [1] 1234

You should get used to the fact that in R, there are often several ways to do the same thing, such as accessing entries.

You might also encounter lists without variable names.

record2
#> [[1]]
#> [1] "John Doe"
#>
#> [[2]]
#> [1] 1234

If a list does not have names, you cannot extract the elements with $, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:

record2[[1]]
#> [1] "John Doe"

We won’t be using lists until later, but you might encounter one in your own exploration of R. For this reason, we show you some basics here.

Matrices

Matrices are another type of object that are common in R. Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type. For this reason data frames are much more useful for storing data, since we can have characters, factors, and numbers in them.

Yet matrices have a major advantage over data frames: we can perform matrix algebra operations, a powerful type of mathematical technique. We do not describe these operations in this book, but much of what happens in the background when you perform a data analysis involves matrices. Matrices are coverd in more detail in Chapter 33.1 of the course by Prof Irizarry¹ but we will describe them briefly here since some of the functions we will learn return matrices.

We can define a matrix using the matrix function. We need to specify the number of rows and columns.

mat <- matrix(1:12, 4, 3)
mat
#>      [,1] [,2] [,3]
#> [1,]    1    5    9
#> [2,]    2    6   10
#> [3,]    3    7   11
#> [4,]    4    8   12

You can access specific entries in a matrix using square brackets ([). If you want the second row, third column, you use:

mat[2, 3]
#> [1] 10

If you want the entire second row, you leave the column spot empty:

mat[2, ]
#> [1]  2  6 10

Notice that this returns a vector, not a matrix.

Similarly, if you want the entire third column, you leave the row spot empty:

mat[, 3]
#> [1]  9 10 11 12

This is also a vector, not a matrix.

You can access more than one column or more than one row if you like. This will give you a new matrix.

mat[, 2:3]
#>      [,1] [,2]
#> [1,]    5    9
#> [2,]    6   10
#> [3,]    7   11
#> [4,]    8   12

You can subset both rows and columns:

mat[1:2, 2:3]
#>      [,1] [,2]
#> [1,]    5    9
#> [2,]    6   10

We can convert matrices into data frames using the function as.data.frame:

as.data.frame(mat)
#>   V1 V2 V3
#> 1  1  5  9
#> 2  2  6 10
#> 3  3  7 11
#> 4  4  8 12

You can also use single square brackets ([) to access rows and columns of a data frame:

data("murders")
murders[25, 1]
#> [1] "Mississippi"
murders[2:3, ]
#>     state abb region population total
#> 2  Alaska  AK   West     710231    19
#> 3 Arizona  AZ   West    6392017   232