Getting Started in R and Rstudio

# ==============================
# Lab 1: Getting Started in R and Rstudio
#  - Steven V. Miller (EH 1903)
# ==============================
#    \
#      \
#        \
#     /\-/\
#    /a a  \                     _
#   =\ Y  =/-~~~~~~-,___________/ )
#     ‛^--‛          ___________/ /
#      \           /
#      ||  |---‛\  \
#     (_(__|   ((__|

Notice the pound sign/hashtag (#) starting every line? That starts a comment. Lots of programming languages have comment tags. In HTML, for example, it’s a bracket of . In CSS, it’s /* comment */. In LaTeX, it’s a percentage sign (%). In R, it’s this character. Make liberal use of this as it allows you to make comments to yourself and explain what you’re doing. Every line that starts with it is ignored in execution by R. Sometimes you want that, especially when you’re having to explain yourself to your future self (or to new students).

If you’re using Rstudio, might I recommend the following cosmetic change to Rstudio. Go to Tools -> Global Options and, in the pop-up, select “Pane Layout”. Rearrange it so that “Source” is top left, “Console” is top right, and the files/plots/packages/etc. is the bottom right. Thereafter: apply the changes. You should see something like what I have. In the bottom left pane you see, which is the one that has the Environment and History tabs, minimize the pane by pressing that minimize button you see near the top right of that bottom left pane (it’ll look like a minus sign). This isn’t mandatory, but it makes the best use of space for how you’ll end up using RStudio. Now, let’s get started.

The syllabus prompted you to install some R packages, and the hope is you have already. This particular function will effectively force you to install the packages if you have not already. Let it also be a preview of what a function looks like in the R programming language. I won’t belabor the specifics of what each line of this function is doing, but I want you to use your mouse/trackpad and click on the first line (assuming you are using Rstudio). Wait for your cursor to blink. Then, for you Mac users, hit Cmd-Enter. For you Windows or Linux users: Ctrl-Enter. Congratulations! You just defined your first function in R and loaded your first “object” in R. If you re-open the “Environment” pane, you’ll see it listed there as a user-defined function.

if_not_install <- function(packages) {
  new_pack <- packages[!(packages %in% installed.packages()[,"Package"])]
  if(length(new_pack)) install.packages(new_pack)
}

This hasn’t done anything yet. It’s just defined a function that will do something once you’ve run it. Let’s run it now. Same as before, move your cursor over the piece of code you see below. Click on it. Now hit Cmd-Enter or Ctrl-Enter on your keyboard.

if_not_install(c("tidyverse","stevedata","stevemisc", 
                 "stevethemes", "stevetemplates"))

In my case: this did nothing. Ideally in your case it did nothing too. That would be because you already have these packages installed. If you don’t have one or more of these packages installed, it will install them. I’m going to load {tidyverse} because I’m going to use it downstream in this script.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.2     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Get Acclimated in R

Now that you’ve done that, let’s get a general sense of where you are in an R session.

Current Working Directory

First, let’s start with identifying the current working directory. You should know where you are and this happens to be where I am, given the location of this script.

getwd()
#> [1] "/home/steve/Dropbox/teaching/eh1903-ir3/2/scripts"

Of note: by default, R’s working directory is the system’s “home” directory. This is somewhat straightforward in Unix-derivative systems, where there is an outright “home” directory. Assume your username is “steve”, then, in Linux, your home directory will be “/home/steve”. In Mac, I think it’s something like “/Users/steve”. Windows users will invariably have something clumsy like “C:/Users/steve/Documents”. Notice the forward slashes. R, like everything else in the world, uses forward slashes. The backslashes owe to Windows’ derivation from DOS.

Create “Objects” in the Environment

One thing you’ll need to get comfortable doing in most statistical programming applications, but certainly this one, is creating objects in your working environment. R has very few built-in, so-called “internal” objects. For example, here’s one.

pi
#> [1] 3.141593

Most things you have to create and assign yourself. For example, let’s assume we wanted to create a character vector of the Nordic countries using their three-character ISO codes. It would be something like this.

c("SWE", "FIN", "NOR", "ISL", "DNK")
#> [1] "SWE" "FIN" "NOR" "ISL" "DNK"

If we run this, we get a character vector corresponding to those countries’ ISO codes. However, it’s basically lost to history. It doesn’t exist in the environment because we did not assign it to anything. In R, you have to “assign” the objects you create to something you name if you want to keep returning to it. It’d go something like this.

Norden <- c("SWE", "FIN", "NOR", "ISL", "DNK")
Norden
#> [1] "SWE" "FIN" "NOR" "ISL" "DNK"

You can go nuts here with object assignment and the world is truly your oyster. You have multiple assignment mechanisms here too. In many basic applications, something like this is equivalent.

c("SWE", "FIN", "NOR", "ISL", "DNK") -> Norden
Norden = c("SWE", "FIN", "NOR", "ISL", "DNK")

FWIW, the equal sign is one you want to be careful with as it’s not how R wants to think or encourage you to think about assignment. I encourage you to get comfortable with arrow assignment, using <- or ->.

Some caution, though. First, don’t create objects with really complex names. To call them back requires getting every character right in the console or script. Why inconvenience yourself? Second, R comes with some default objects that are kinda important and can seriously ruin things downstream. I don’t know off the top of my head all the default objects in R, but there are some important ones like TRUE, and FALSE that you DO NOT want to overwrite. pi is another one you should not overwrite, and data is a function that serves a specific purpose (even if you probably won’t be using it a whole lot). You can, however, assign some built-in objects to new objects.

this_Is_a_long_AND_WEIRD_objEct_name_and_yOu_shoUld_not_do_this <- 5
pi # notice there are a few built-in functions/objects
#> [1] 3.141593
d <- pi # you can assign one built-in object to a new object.
# pi <- 3.14 # don't do this....

If you do something dumb (like overwrite TRUE with something), all hope is not lost. Remove the object in question (e.g. rm(TRUE)). Restart R and you’ll reclaim some built-in object that you overwrote.

Load Data

Problem sets and lab scripts will lean on data I make available in {stevedata}, or have available for you in Athena. However, you may often find that you want to download a data set from somewhere else and load it into R. Example data sets would be stuff like European Values Survey, European Social Survey, or Varieties of Democracy, or whatever else. You can do this any number of ways, and it will depend on what is the file format you downloaded. Here are some commands you’ll want to learn for these circumstances:

Notice that functions like read_dta(), read_spss(), and read_excel() require some other packages that I didn’t mention. However, these other packages/libraries are part of the {tidyverse} and are just not loaded directly with them. Under these conditions, you can avoid directly loading a library into a session by referencing it first and grabbing the function you want from within it separated by two colons (::). Basically, haven::read_dta() could be interpreted as a command saying “using the {haven} library, grab the read_dta() command in it”.

These wrappers are also flexible with files on the internet. For example, this will work. Just remember to assign them to an object.

# Note: hypothetical data
# Source: https://stats.oarc.ucla.edu/stata/dae/ordered-logistic-regression/
Apply <- haven::read_dta("https://stats.idre.ucla.edu/stat/data/ologit.dta")

Because we loaded these data and assigned it to an object, we can ask for it using default methods available in R and look at what we just loaded.

Apply
#> # A tibble: 400 × 4
#>    apply               pared public   gpa
#>    <dbl+lbl>           <dbl>  <dbl> <dbl>
#>  1 2 [very likely]         0      0  3.26
#>  2 1 [somewhat likely]     1      0  3.21
#>  3 0 [unlikely]            1      1  3.94
#>  4 1 [somewhat likely]     0      0  2.81
#>  5 1 [somewhat likely]     0      0  2.53
#>  6 0 [unlikely]            0      1  2.59
#>  7 1 [somewhat likely]     0      0  2.56
#>  8 1 [somewhat likely]     0      0  2.73
#>  9 0 [unlikely]            0      0  3   
#> 10 1 [somewhat likely]     1      0  3.5 
#> # ℹ 390 more rows

The “tibble” output tells us something about our data. We can observe that there are 400 observations (or rows, if you will) and that there are four columns in the data. It’s incidentally the case that because we loaded a Stata .dta file, there is value label information for them. So, the apply variable is just 0, 1, and 2, but 0 means “unlikely”, 1 means “somewhat likely”, and 2 means “very likely”. That’s situationally useful, especially for beginners, but we’re going to ignore it for now.

There are other ways to find the dimension of the data set (i.e. rows and columns). For example, you can ask for the dimensions of the object itself.

dim(Apply)
#> [1] 400   4

Convention is rows-columns, so this first element in this numeric vector tells us there are 400 rows and the second one tells us there are four columns. You can also do this.

nrow(Apply)
#> [1] 400
ncol(Apply)
#> [1] 4

Learn Some Important R/“Tidy” Functions

I want to spend most of our time in this lab session teaching you some basic commands you should know to do basically anything in R. These are so-called “tidy” verbs. We’ll be using this Apply data that we just loaded from the internet. I want to dedicate the bulk of this section to learning some core functions that are part of the {tidyverse}. My introduction here will inevitably be incomplete because there’s only so much I can teach within the limited time I have. That said, I’m going to focus on the following functions available in the {tidyverse} that totally rethink base R. These are the “pipe” (%>%), glimpse() and summary(), select(), group_by(), summarize(), mutate(), and filter().

The Pipe (%>%)

I want to start with the pipe because I think of it as the most important function in the {tidyverse}. The pipe—represented as %>%—allows you to chain together a series of functions. The pipe is especially useful if you’re recoding data and you want to make sure you got everything the way you wanted (and correct) before assigning the data to another object. You can chain together a lot of {tidyverse} commands with pipes, but we’ll keep our introduction here rather minimal because I want to use it to teach about some other things.

glimpse() and summary()

glimpse() and summary() will get you basic descriptions of your data. Personally, I find summary() more informative than glimpse() though glimpse() is useful if your data have a lot of variables and you want to just peek into the data without spamming the R console without output.

Notice, here, the introduction of the pipe (%>%). In the commands below, Apply %>% glimpse() is equivalent to glimpse(Apply), but I like to lean more on pipes than perhaps others would. My workflow starts with (data) objects, applies various functions to them, and assigns them to objects. I think you’ll get a lot of mileage thinking that same way too.

Apply %>% glimpse() # notice the pipe
#> Rows: 400
#> Columns: 4
#> $ apply  <dbl+lbl> 2, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 2, 1, 0, 0, 0, 0, 2, 1,…
#> $ pared  <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, …
#> $ public <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, …
#> $ gpa    <dbl> 3.26, 3.21, 3.94, 2.81, 2.53, 2.59, 2.56, 2.73, 3.00, 3.50, 3.6…
Apply %>% summary()
#>      apply          pared            public            gpa       
#>  Min.   :0.00   Min.   :0.0000   Min.   :0.0000   Min.   :1.900  
#>  1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2.720  
#>  Median :0.00   Median :0.0000   Median :0.0000   Median :2.990  
#>  Mean   :0.55   Mean   :0.1575   Mean   :0.1425   Mean   :2.999  
#>  3rd Qu.:1.00   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:3.270  
#>  Max.   :2.00   Max.   :1.0000   Max.   :1.0000   Max.   :4.000

Of note: notice the summary function (alternatively summary(Apply)) gives you basic descriptive statistics. You can see the mean and median, which are routinely statistics of central tendency that we care about. In this hypothetical data, we can see that the mean of the public variable is .1425. Because this is a dummy variable, it tells us that 14.25% of the observations are 1. We can see that the mean GPA is 2.999 and the median is 2.99. This gives us some preliminary insight that there isn’t a major distribution/skew problem here.

select()

select() is useful for basic (but important) data management. You can use it to grab (or omit) columns from data. For example, let’s say I wanted to grab all the columns in the data. I could do that with the following command.

Apply %>% select(everything())  # grab everything
#> # A tibble: 400 × 4
#>    apply               pared public   gpa
#>    <dbl+lbl>           <dbl>  <dbl> <dbl>
#>  1 2 [very likely]         0      0  3.26
#>  2 1 [somewhat likely]     1      0  3.21
#>  3 0 [unlikely]            1      1  3.94
#>  4 1 [somewhat likely]     0      0  2.81
#>  5 1 [somewhat likely]     0      0  2.53
#>  6 0 [unlikely]            0      1  2.59
#>  7 1 [somewhat likely]     0      0  2.56
#>  8 1 [somewhat likely]     0      0  2.73
#>  9 0 [unlikely]            0      0  3   
#> 10 1 [somewhat likely]     1      0  3.5 
#> # ℹ 390 more rows

Do note this is kind of a redundant command. You could just as well spit the entire data into the console and it would’ve done the same thing. Still, here’s if I wanted everything except wanted to drop the labor share of income variable.

Apply %>% select(-public) # grab everything, but drop the public variable.
#> # A tibble: 400 × 3
#>    apply               pared   gpa
#>    <dbl+lbl>           <dbl> <dbl>
#>  1 2 [very likely]         0  3.26
#>  2 1 [somewhat likely]     1  3.21
#>  3 0 [unlikely]            1  3.94
#>  4 1 [somewhat likely]     0  2.81
#>  5 1 [somewhat likely]     0  2.53
#>  6 0 [unlikely]            0  2.59
#>  7 1 [somewhat likely]     0  2.56
#>  8 1 [somewhat likely]     0  2.73
#>  9 0 [unlikely]            0  3   
#> 10 1 [somewhat likely]     1  3.5 
#> # ℹ 390 more rows

Here’s a more typical case. Assume you’re working with a large data object and you just want a handful of things. In this case, we have these four variables, but we want just the first three columns and want to drop everything else. Here’s how we’d do that in the select() function, again with some assistance from the pipe.

Apply %>% select(apply:public) # grab just these three columns.
#> # A tibble: 400 × 3
#>    apply               pared public
#>    <dbl+lbl>           <dbl>  <dbl>
#>  1 2 [very likely]         0      0
#>  2 1 [somewhat likely]     1      0
#>  3 0 [unlikely]            1      1
#>  4 1 [somewhat likely]     0      0
#>  5 1 [somewhat likely]     0      0
#>  6 0 [unlikely]            0      1
#>  7 1 [somewhat likely]     0      0
#>  8 1 [somewhat likely]     0      0
#>  9 0 [unlikely]            0      0
#> 10 1 [somewhat likely]     1      0
#> # ℹ 390 more rows

Grouped functions using .by arguments

I think the pipe is probably the most important function in the {tidyverse} even as a critical reader might note that the pipe is 1) a port from another package ({magrittr}) and 2) now a part of base R in a different terminology. Thus, the critical reader (and probably me, depending on my mood) may note that grouped functions/arguments serve as probably the most important component of the {tidyverse}. It used to be group_by() that did this, but now most functions in the {tidyverse} have .by arguments that are arguably more efficient for this purpose. Basically, grouping the data—either with the deprecated group_by() or .by argument—allows you to “split” the data into various subsets, “apply” various functions to them, and “combine” them into one output. You might see that terminology “split-apply-combine” as you learn more about the {tidyverse} and its development.

Here, let’s do a simple exercise : slice(). slice() lets you index rows by integer locations and can be useful for peeking into the data or curating it (by doing something like removing duplicate observations). In this simple case, we’re going to slice the data by the first observation at each level of the apply variable.

# Notice we can chain some pipes together
Apply %>%
  # Get me the first observation, by group.
  slice(1, .by=apply)
#> # A tibble: 3 × 4
#>   apply               pared public   gpa
#>   <dbl+lbl>           <dbl>  <dbl> <dbl>
#> 1 2 [very likely]         0      0  3.26
#> 2 1 [somewhat likely]     1      0  3.21
#> 3 0 [unlikely]            1      1  3.94

# This is the older way of doing it. It still works, but the presentation of what
# it did slightly differs.

Apply %>%
  group_by(apply) %>%
  slice(1) %>%
  ungroup() # practice safe group_by()
#> # A tibble: 3 × 4
#>   apply               pared public   gpa
#>   <dbl+lbl>           <dbl>  <dbl> <dbl>
#> 1 0 [unlikely]            1      1  3.94
#> 2 1 [somewhat likely]     1      0  3.21
#> 3 2 [very likely]         0      0  3.26

Compare the above outputs with this

Apply
#> # A tibble: 400 × 4
#>    apply               pared public   gpa
#>    <dbl+lbl>           <dbl>  <dbl> <dbl>
#>  1 2 [very likely]         0      0  3.26
#>  2 1 [somewhat likely]     1      0  3.21
#>  3 0 [unlikely]            1      1  3.94
#>  4 1 [somewhat likely]     0      0  2.81
#>  5 1 [somewhat likely]     0      0  2.53
#>  6 0 [unlikely]            0      1  2.59
#>  7 1 [somewhat likely]     0      0  2.56
#>  8 1 [somewhat likely]     0      0  2.73
#>  9 0 [unlikely]            0      0  3   
#> 10 1 [somewhat likely]     1      0  3.5 
#> # ℹ 390 more rows

If you don’t group-by the category first, slice(., 1) will just return the first observation in the data set.

Apply %>%
  # Get me the first observation for each values of the apply variable
  slice(1) # womp womp. Forgot to use the .by argument
#> # A tibble: 1 × 4
#>   apply           pared public   gpa
#>   <dbl+lbl>       <dbl>  <dbl> <dbl>
#> 1 2 [very likely]     0      0  3.26

I offer one caveat here. If you’re applying a group-specific function (that you need just once), it’s generally advisable to “ungroup()” (i.e. ungroup()) as the next function in your pipe chain if you are using the group_by() approach. As you build together chains/pipes, the intermediate output you get will advise you of any “groups” you’ve declared in your data. Don’t lose track of those.

summarize()

summarize() creates condensed summaries of your data, for whatever it is that you want. Here, for example, is a kind of dumb way of seeing how many observations are in the data. nrow(Apply) works just as well, but alas…

Apply %>%
  # How many observations are in the data?
  summarize(n = n())
#> # A tibble: 1 × 1
#>       n
#>   <int>
#> 1   400

# How many observations are there by levels of the apply variable?

Apply %>%
  summarize(n = n(), .by=apply)
#> # A tibble: 3 × 2
#>   apply                   n
#>   <dbl+lbl>           <int>
#> 1 2 [very likely]        40
#> 2 1 [somewhat likely]   140
#> 3 0 [unlikely]          220

# What you did, indirectly here, was find the mode of the apply variable. This is
# the most frequently occurring value in a variable, which is really only of interest
# to unordered-categorical or ordered-categorical variables. Statisticians really
# don't care about the mode, and it's why there is no real built-in function in base
# R that says "Here's the mode." You have to get it indirectly.

More importantly, summarize() works wonderfully with group_by(). For example, for each country (group_by(apply)), let’s get the mean and median GPA by each value of apply

Apply %>%
  # Give me the average GPA by each value of `apply`
  summarize(mean_gpa = mean(gpa, na.rm = TRUE),
            median_gpa = median(gpa, na.rm = TRUE),
            .by = apply)
#> # A tibble: 3 × 3
#>   apply               mean_gpa median_gpa
#>   <dbl+lbl>              <dbl>      <dbl>
#> 1 2 [very likely]         3.15       3.13
#> 2 1 [somewhat likely]     3.03       3.02
#> 3 0 [unlikely]            2.95       2.98

One downside (or feature, depending on your perspective) to summarize() is that it condenses data and discards stuff that’s not necessary for creating the condensed output. In the case above, notice we didn’t ask for anything else about the data, other than the average GPA by each value of how likely they are to apply for graduate school. Thus, we didn’t get anything else, beyond the average GPA by how likely applicants are to apply for grad school.

mutate()

mutate() is probably the most important {tidyverse} function for data management/recoding. It will allow you to create new columns while retaining the original dimensions of the data. Consider it the sister function to summarize(). But, where summarize() discards, mutate() retains.

Let’s do something simple with mutate(). For example, we can create a new variable that just counts the length of the data frame. We can think of this as a kind of identifier variable. The first row is the first respondent. The second row is the second respondent, and so on.

Apply %>%
  mutate(id = 1:n()) %>%
  select(id, everything()) -> Apply

Again, the world is your oyster here. We can also recode that apply variable to be a dummy variable that equals 1 if and only if the respondent is very likely to apply to grad school.

Apply %>%
  mutate(vlapply = ifelse(apply == 2, 1, 0)) -> Apply

We can use the distinct() function (also in {tidyverse}) to see that it worked.

Apply %>% distinct(vlapply, apply)
#> # A tibble: 3 × 2
#>   vlapply apply              
#>     <dbl> <dbl+lbl>          
#> 1       1 2 [very likely]    
#> 2       0 1 [somewhat likely]
#> 3       0 0 [unlikely]

filter()

filter() is a great diagnostic tool for subsetting your data to look at particular observations. Notice one little thing, especially if you’re new to programming. The use of double-equal signs (==) is for making logical statements where as single-equal signs (=) is for object assignment or column creation. If you’re using filter(), you’re probably wanting to find cases where something equals something (==), is greater than something (>), equal to or greater than something (>=), is less than something (<), or is less than or equal to something (<=).

The benefit of the ID variable that we created, though, is we can do something like find the highest GPA per value of how likely they are to apply to grad school.

Apply %>%
  filter(gpa == max(gpa),
         .by = apply)
#> # A tibble: 3 × 6
#>      id apply               pared public   gpa vlapply
#>   <int> <dbl+lbl>           <dbl>  <dbl> <dbl>   <dbl>
#> 1     3 0 [unlikely]            1      1  3.94       0
#> 2    13 2 [very likely]         0      1  3.90       1
#> 3    90 1 [somewhat likely]     0      0  4          0

This tells us, for example, that the third respondent in the data incidentally has the highest GPA for someone who says they are very unlikely to apply for grad school.