Before we get started with the motivating dataset, we need to cover the very basics of R.
Suppose a high school student asks us for help solving several quadratic equations of the form \(ax^2+bx+c = 0\). The quadratic formula gives us the solutions:
\(\frac{-b - \sqrt{b^2 - 4ac}}{2a}\,\, \mbox{ and } \frac{-b + \sqrt{b^2 - 4ac}}{2a}\) which of course change depending on the values of \(a\), \(b\), and \(c\). One advantage of programming languages is that we can define variables and write expressions with these variables, similar to how we do so in math, but obtain a numeric solution. We will write out general code for the quadratic equation below, but if we are asked to solve \(x^2 + x -1 = 0\), then we define:
a <- 1
b <- 1
c <- -1
which stores the values for later use. We use <- to assign values to
the variables.
We can also assign values using = instead of <-, but we recommend not using = to avoid confusion.
Copy and paste the code above into your console to define the three variables. Note that R does not print anything when we make this assignment. This means the objects were defined successfully. Had you made a mistake, you would have received an error message.
To see the value stored in a variable, we simply ask R to evaluate a
and it shows the stored value:
a
#> [1] 1
A more explicit way to ask R to show us the value stored in a is using
print like this:
print(a)
#> [1] 1
We use the term object to describe stuff that is stored in R. Variables are examples, but objects can also be more complicated entities such as functions, which are described later.
As we define objects in the console, we are actually changing the workspace. You can see all the variables saved in your workspace by typing:
ls()
#> [1] "a"        "b"        "c"        "dat"      "img_path" "murders"
In RStudio, the Environment tab shows the values:

We should see a, b, and c. If you try to recover the value of a
variable that is not in your workspace, you receive an error. For
example, if you type x you will receive the following message: Error:
object 'x' not found.
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
(-b + sqrt(b^2 - 4*a*c) ) / ( 2*a )
#> [1] 0.618
(-b - sqrt(b^2 - 4*a*c) ) / ( 2*a )
#> [1] -1.62
Once you define variables, the data analysis process can usually be described as a series of functions applied to the data. R includes several predefined functions and most of the analysis pipelines we construct make extensive use of these.
We already used the install.packages, library, and ls functions.
We also used the function sqrt to solve the quadratic equation above.
There are many more prebuilt functions and even more can be added
through packages. These functions do not appear in the workspace because
you did not define them, but they are available for immediate use.
In general, we need to use parentheses to evaluate a function. If you
type ls, the function is not evaluated and instead R shows you the
code that defines the function. If you type ls() the function is
evaluated and, as seen above, we see objects in the workspace.
Unlike ls, most functions require one or more arguments. Below is an
example of how we assign an object to the argument of the function
log. Remember that we earlier defined a to be 1:
log(8)
#> [1] 2.08
log(a)
#> [1] 0
You can find out what the function expects and what it does by reviewing
the very useful manuals included in R. You can get help by using the
help function like this:
help("log")
For most functions, we can also use this shorthand:
?log
The help page will show you what arguments the function is expecting.
For example, log needs x and base to run. However, some arguments
are required and others are optional. You can determine which arguments
are optional by noting in the help document that a default value is
assigned with =. Defining these is optional. For example, the base of
the function log defaults to base = exp(1) making log the natural
log by default.
If you want a quick look at the arguments without opening the help system, you can type:
args(log)
#> function (x, base = exp(1))
#> NULL
You can change the default values by simply assigning another object:
log(8, base = 2)
#> [1] 3
Note that we have not been specifying the argument x as such:
log(x = 8, base = 2)
#> [1] 3
The above code works, but we can save ourselves some typing: if no
argument name is used, R assumes you are entering arguments in the order
shown in the help file or by args. So by not using the names, it
assumes the arguments are x followed by base:
log(8,2)
#> [1] 3
If using the arguments’ names, then we can include them in whatever order we want:
log(base = 2, x = 8)
#> [1] 3
To specify arguments, we must use =, and cannot use <-.
There are some exceptions to the rule that functions need the parentheses to be evaluated. Among these, the most commonly used are the arithmetic and relational operators. For example:
2 ^ 3
#> [1] 8
You can see the arithmetic operators by typing:
help("+")
or
?"+"
and the relational operators by typing:
help(">")
or
?">"
There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing:
data()
This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type:
co2
R will show you Mauna Loa atmospheric CO2 concentration data.
Other prebuilt objects are mathematical quantities, such as the constant \(\pi\) and \(\infty\):
pi
#> [1] 3.14
Inf+1
#> [1] Inf
We have used the letters a, b, and c as variable names, but
variable names can be almost anything. Some basic rules in R are that
variable names have to start with a letter, can’t contain spaces, and
should not be variables that are predefined in R. For example, don’t
name one of your variables install.packages by typing something like
install.packages <- 2.
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces. For the quadratic equations, we could use something like this:
solution_1 <- (-b + sqrt(b^2 - 4*a*c)) / (2*a)
solution_2 <- (-b - sqrt(b^2 - 4*a*c)) / (2*a)
For more advice, we highly recommend studying Hadley Wickham’s style guide1.
Values remain in the workspace until you end your session or erase them
with the function rm. But workspaces also can be saved for later use.
In fact, when you quit R, the program asks you if you want to save your
workspace. If you do save it, the next time you start R, the program
will restore the workspace.
We actually recommend against saving the workspace this way because, as
you start working on different projects, it will become harder to keep
track of what is saved. Instead, we recommend you assign the workspace a
specific name. You can do this by using the function save or
save.image. To load, use the function load. When saving a workspace,
we recommend the suffix rda or RData. In RStudio, you can also do
this by navigating to the Session tab and choosing Save Workspace
as. You can later load it using the Load Workspace options in the
same tab. You can read the help pages on save, save.image, and
load to learn more.
To solve another equation such as \(3x^2 + 2x -1\), we can copy and paste the code above and then redefine the variables and recompute the solution:
a <- 3
b <- 2
c <- -1
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)
By creating and saving a script with the code above, we would not need to retype everything each time and, instead, simply change the variable names. Try writing the script above into an editor and notice how easy it is to change the variables and receive an answer.
If a line of R code starts with the symbol #, it is not evaluated. We
can use this to write reminders of why we wrote particular code. For
example, in the script above we could add:
## Code to compute solution to quadratic equation of the form ax^2 + bx + c
## define the variables
a <- 3
b <- 2
c <- -1
## now compute the solution
(-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a)