An Intro to R, RStudio, and {tidyverse}
This lab scripts offers what I think to be a gentle introduction to R and RStudio. It will try to acclimate students with R as programming language and RStudio as IDE for the R programming language. There will be a few recurring themes here that are subtle but deceptively critical. 1) You really must know where you are on your computer without referencing to icons you can push (i.e. know your working directory and the path to it). 2) You can push any number of buttons in RStudio, but everything is still a command in a terminal/console. Pay careful attention to that information as it’s communicated to you.
# ==============================
# Lab 1: Getting Started in R and Rstudio
# - Steven V. Miller (EH 1903)
# ==============================
# \
# \
# \
# /\-/\
# /a a \ _
# =\ Y =/-~~~~~~-,___________/ )
# ‛^--‛ ___________/ /
# \ /
# || |---‛\ \
# (_(__| ((__|
Notice the pound sign/hashtag (#) starting every line? That starts a comment. Lots of programming languages have comment tags. In HTML, for example, it’s a bracket of . In CSS, it’s /* comment */. In LaTeX, it’s a percentage sign (%). In R, it’s this character. Make liberal use of this as it allows you to make comments to yourself and explain what you’re doing. Every line that starts with it is ignored in execution by R. Sometimes you want that, especially when you’re having to explain yourself to your future self (or to new students).
If you’re using Rstudio, might I recommend the following cosmetic change to Rstudio. Go to Tools -> Global Options and, in the pop-up, select “Pane Layout”. Rearrange it so that “Source” is top left, “Console” is top right, and the files/plots/packages/etc. is the bottom right. Thereafter: apply the changes. You should see something like what I have. In the bottom left pane you see, which is the one that has the Environment and History tabs, minimize the pane by pressing that minimize button you see near the top right of that bottom left pane (it’ll look like a minus sign). This isn’t mandatory, but it makes the best use of space for how you’ll end up using RStudio. Now, let’s get started.
The syllabus prompted you to install some R packages, and the hope is you have already. This particular function will effectively force you to install the packages if you have not already. Let it also be a preview of what a function looks like in the R programming language. I won’t belabor the specifics of what each line of this function is doing, but I want you to use your mouse/trackpad and click on the first line (assuming you are using Rstudio). Wait for your cursor to blink. Then, for you Mac users, hit Cmd-Enter. For you Windows or Linux users: Ctrl-Enter. Congratulations! You just defined your first function in R and loaded your first “object” in R. If you re-open the “Environment” pane, you’ll see it listed there as a user-defined function.
if_not_install <- function(packages) {
new_pack <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_pack)) install.packages(new_pack)
}
This hasn’t done anything yet. It’s just defined a function that will do something once you’ve run it. Let’s run it now. Same as before, move your cursor over the piece of code you see below. Click on it. Now hit Cmd-Enter or Ctrl-Enter on your keyboard.
if_not_install(c("tidyverse","stevedata","stevemisc",
"stevethemes", "stevetemplates"))
In my case: this did nothing. Ideally in your case it did nothing too. That would be because you already have these packages installed. If you don’t have one or more of these packages installed, it will install them. I’m going to load {tidyverse} because I’m going to use it downstream in this script. In anything you do, whether for a problem set in this course or for your own projects, you’ll typically be loading your libraries like this at the top of your script. Be mindful of that too when you’re working interactively in a given lab session with me, but then need to do an assignment where I have to assume you’re starting from scratch. Be explicit; load your libraries, and typically at the very top of your script.
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.4
#> ✔ forcats 1.0.0 ✔ stringr 1.5.0
#> ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
#> ✔ purrr 1.0.1
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Get Acclimated in R
Now that you’ve done that, let’s get a general sense of where you are in an R session.
Current Working Directory
First, let’s start with identifying the current working directory. You should know where you are and this happens to be where I am, given the location of this script.
getwd()
#> [1] "/home/steve/Dropbox/teaching/eh1903-ir3/2/scripts"
Of note: by default, R’s working directory is the system’s “home” directory. This is somewhat straightforward in Unix-derivative systems, where there is an outright “home” directory. Assume your username is “steve”, then, in Linux, your home directory will be “/home/steve”. In Mac, I think it’s something like “/Users/steve”. Windows users will invariably have something clumsy like “C:/Users/steve/Documents”. Notice the forward slashes. R, like everything else in the world, uses forward slashes. The backslashes owe to Windows’ derivation from DOS.
Create “Objects” in the Environment
One thing you’ll need to get comfortable doing in most statistical programming applications, but certainly this one, is creating objects in your working environment. R has very few built-in, so-called “internal” objects. For example, here’s one.
pi
#> [1] 3.141593
Most things you have to create and assign yourself. For example, let’s assume we wanted to create a character vector of the Nordic countries using their three-character ISO codes. It would be something like this.
c("SWE", "FIN", "NOR", "ISL", "DNK")
#> [1] "SWE" "FIN" "NOR" "ISL" "DNK"
If we run this, we get a character vector corresponding to those countries’ ISO codes. However, it’s basically lost to history. It doesn’t exist in the environment because we did not assign it to anything. In R, you have to “assign” the objects you create to something you name if you want to keep returning to it. It’d go something like this.
Norden <- c("SWE", "FIN", "NOR", "ISL", "DNK")
Norden
#> [1] "SWE" "FIN" "NOR" "ISL" "DNK"
You can go nuts here with object assignment and the world is truly your oyster. You have multiple assignment mechanisms here too. In many basic applications, something like this is equivalent.
c("SWE", "FIN", "NOR", "ISL", "DNK") -> Norden
Norden = c("SWE", "FIN", "NOR", "ISL", "DNK")
FWIW, the equal sign is one you want to be careful with as it’s not how
R wants to think or encourage you to think about assignment. I encourage
you to get comfortable with arrow assignment, using <-
or ->
.
Some caution, though. First, don’t create objects with really complex
names. To call them back requires getting every character right in the
console or script. Why inconvenience yourself? Second, R comes with some
default objects that are kinda important and can seriously ruin things
downstream. I don’t know off the top of my head all the default objects
in R, but there are some important ones like TRUE
, and FALSE
that
you DO NOT want to overwrite. pi
is another one you should not
overwrite, and data
is a function that serves a specific purpose (even
if you probably won’t be using it a whole lot). You can, however, assign
some built-in objects to new objects.
this_Is_a_long_AND_WEIRD_objEct_name_and_yOu_shoUld_not_do_this <- 5
pi # notice there are a few built-in functions/objects
#> [1] 3.141593
d <- pi # you can assign one built-in object to a new object.
pi <- 3.14 # don't do this....
If you do something dumb (like overwrite TRUE
with something), all
hope is not lost. Remove the object in question (e.g. rm(TRUE)
).
Restart R and you’ll reclaim some built-in object that you overwrote.
Load Data
Problem sets and lab scripts will lean on data I make available in
{stevedata}
, or have available for you in Athena. However, you may
often find that you want to download a data set from somewhere else and
load it into R. Example data sets would be stuff like European Values
Survey, European Social Survey, or Varieties of Democracy, or whatever
else. You can do this any number of ways, and it will depend on what is
the file format you downloaded. Here are some commands you’ll want to
learn for these circumstances:
haven::read_dta()
: for loading Stata .dta fileshaven::read_spss()
: for loading SPSS binaries (typically .sav files)read_csv()
: for loading comma-separated values (CSV) filesreadxl::read_excel()
: for loading MS Excel spreadsheets.read_tsv()
: for tab-separated values (TSV) filesreadRDS()
: for R serialized data frames, which are awesome for file compression/speed.
Notice that functions like read_dta()
, read_spss()
, and
read_excel()
require some other packages that I didn’t mention.
However, these other packages/libraries are part of the {tidyverse}
and are just not loaded directly with them. Under these conditions, you
can avoid directly loading a library into a session by referencing it
first and grabbing the function you want from within it separated by two
colons (::
). Basically, haven::read_dta()
could be interpreted as a
command saying “using the {haven}
library, grab the read_dta()
command in it”.
These wrappers are also flexible with files on the internet. For example, this will work. Just remember to assign them to an object.
Data <- haven::read_dta("http://svmiller.com/extdata/eu2019.dta")
# Data <- readRDS(url("http://svmiller.com/extdata/eu2019.rds"))
# ^ this will work too, but readRDS() requires url() for wrapping the location of the file.
As a quick aside, I want you to use this as an opportunity to be sure
you’ve read a recent guide I put on my
blog
about how to use the {WDI}
package to access World Bank Open
Data. I won’t belabor these data in too
great a detail, but these are all European Union states in 2019 by
various metrics. These are income inequality (gini
), FDI net inflows
as a percentage of GDP (fdipgdp
), exports as a percentage of GDP
(exppgdp
), the real effective exchange rate (reer
), tax revenue as a
percentage of GDP (taxrevpgdp
), GDP in constant 2015 USD (gdp
), and
population size (pop
). The wp
variable communicates whether the
European Union state was in the Warsaw Pact or not. Former republics of
the Soviet Union (e.g. Estonia) and Poland, for example, would both be
1s. France and the United Kingdom would both be 0. Because we loaded
these data and assigned it to an object, we can ask for it using default
methods available in R and look at what we just loaded.
Data
#> # A tibble: 28 × 12
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7
#> 2 Belgium BE BEL 2019 0 27.2 -1.97 83.0 99.8 22.6
#> 3 Bulgaria BG BGR 2019 1 40.3 3.24 64.2 102. 20.6
#> 4 Croatia HR HRV 2019 0 28.9 6.47 50.5 94.1 22.0
#> 5 Cyprus CY CYP 2019 0 31.2 202. 75.8 87.5 23.1
#> 6 Czechia CZ CZE 2019 1 25.3 4.19 72.1 99.6 14.5
#> 7 Denmark DK DNK 2019 0 27.7 -1.10 59.0 95.6 34.9
#> 8 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8
#> 9 Finland FI FIN 2019 0 27.7 6.10 40.6 97.0 20.8
#> 10 France FR FRA 2019 0 31.2 1.96 32.9 93.9 24.6
#> # ℹ 18 more rows
#> # ℹ 2 more variables: gdp <dbl>, pop <dbl>
The “tibble” output tells us something about our data. We can observe that there are 28 observations (or rows, if you will) and that there are 11 columns in the data.
There are other ways to find the dimension of the data set (i.e. rows and columns). For example, you can ask for the dimensions of the object itself.
dim(Data)
#> [1] 28 12
Convention is rows-columns, so this first element in this numeric vector tells us there are 28 rows and the second one tells us there are 11 columns. You can also do this.
nrow(Data)
#> [1] 28
ncol(Data)
#> [1] 12
Learn Some Important R/“Tidy” Functions
I want to spend most of our time in this lab session teaching you some
basic commands you should know to do basically anything in R. These are
so-called “tidy” verbs. We’ll be using this Apply data that we just
loaded from the internet. I want to dedicate the bulk of this section to
learning some core functions that are part of the {tidyverse}
. My
introduction here will inevitably be incomplete because there’s only so
much I can teach within the limited time I have. That said, I’m going to
focus on the following functions available in the {tidyverse}
that
totally rethink base R. These are the “pipe” (%>%
), glimpse()
and
summary()
, select()
, summarize()
, mutate()
, and filter()
. Most
of these—certainly the important ones—have a .by
argument that will
also get special attention.
The Pipe (%>%
)
I want to start with the pipe because I think of it as the most
important function in the {tidyverse}
. The pipe—represented as
%>%
—allows you to chain together a series of functions. Its innovation
fundamentally changed R’s default behavior, which other wants to go
line-by-line or work inside out for nested functions. The pipe instead
allows you to think and do things in a more intuitive way. Rather than
work inside out, or copy-paste functions, the pipe gives you the
flexibility to thin left-to-right, and top-to-bottom (for reasons you’ll
see soon). The pipe is especially useful if you’re recoding data and you
want to make sure you got everything the way you wanted (and correct)
before assigning the data to another object. You can chain together a
lot of {tidyverse}
commands with pipes, but we’ll keep our
introduction here rather minimal because I want to use it to teach about
some other things.
glimpse()
and summary()
glimpse()
and summary()
will get you basic descriptions of your
data. Personally, I find summary()
more informative than glimpse()
though glimpse()
is useful if your data have a lot of variables and
you want to just peek into the data without spamming the R console
without output.
Notice, here, the introduction of the pipe (%>%
). In the commands
below, Data %>% glimpse()
is equivalent to glimpse(Data)
, but I like
to lean more on pipes than perhaps others would. My workflow starts with
(data) objects, applies various functions to them, and assigns them to
objects. I think you’ll get a lot of mileage thinking that same way too.
Data %>% glimpse() # notice the pipe
#> Rows: 28
#> Columns: 12
#> $ country <chr> "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Cze…
#> $ iso2c <chr> "AT", "BE", "BG", "HR", "CY", "CZ", "DK", "EE", "FI", "FR",…
#> $ iso3c <chr> "AUT", "BEL", "BGR", "HRV", "CYP", "CZE", "DNK", "EST", "FI…
#> $ year <dbl> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019,…
#> $ wp <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0,…
#> $ gini <dbl> 30.2, 27.2, 40.3, 28.9, 31.2, 25.3, 27.7, 30.8, 27.7, 31.2,…
#> $ fdipgdp <dbl> -2.852556, -1.972530, 3.236477, 6.472772, 201.684997, 4.187…
#> $ exppgdp <dbl> 55.89581, 82.96743, 64.18229, 50.48924, 75.75887, 72.12485,…
#> $ reer <dbl> 101.73053, 99.79796, 102.37821, 94.10345, 87.47805, 99.5891…
#> $ taxrevpgdp <dbl> 25.67596, 22.59782, 20.55721, 21.97302, 23.07736, 14.52457,…
#> $ gdp <dbl> 4.133653e+11, 4.940997e+11, 5.724854e+10, 5.782393e+10, 2.5…
#> $ pop <dbl> 8879920, 11488980, 6975761, 3949390, 1286671, 10671870, 581…
Data %>% summary()
#> country iso2c iso3c year
#> Length:28 Length:28 Length:28 Min. :2019
#> Class :character Class :character Class :character 1st Qu.:2019
#> Mode :character Mode :character Mode :character Median :2019
#> Mean :2019
#> 3rd Qu.:2019
#> Max. :2019
#>
#> wp gini fdipgdp exppgdp
#> Min. :0.0000 Min. :23.20 Min. :-13.674 Min. : 30.95
#> 1st Qu.:0.0000 1st Qu.:28.88 1st Qu.: 1.950 1st Qu.: 41.94
#> Median :0.0000 Median :30.90 Median : 3.196 Median : 60.74
#> Mean :0.3214 Mean :30.91 Mean : 27.520 Mean : 68.23
#> 3rd Qu.:1.0000 3rd Qu.:33.38 3rd Qu.: 6.291 3rd Qu.: 81.89
#> Max. :1.0000 Max. :40.30 Max. :252.920 Max. :206.41
#>
#> reer taxrevpgdp gdp pop
#> Min. : 85.71 Min. :11.21 Min. :1.488e+10 Min. : 504062
#> 1st Qu.: 92.37 1st Qu.:18.83 1st Qu.:5.768e+10 1st Qu.: 3660577
#> Median : 95.94 Median :22.13 Median :2.201e+11 Median : 9325530
#> Mean : 95.56 Mean :21.59 Mean :6.441e+11 Mean :18362127
#> 3rd Qu.: 99.60 3rd Qu.:24.50 3rd Qu.:5.543e+11 3rd Qu.:17851568
#> Max. :103.25 Max. :34.94 Max. :3.673e+12 Max. :83092962
#> NA's :3
Of note: notice the summary function (alternatively summary(Data)
)
gives you basic descriptive statistics. You can see the mean and median,
which are routinely statistics of central tendency that we care about.
Notice the wp
variable, which is binary and communicates whether a
European Union state was previously in (or covered by) the Warsaw Pact.
Here, the median is 0 (which tells you most European states weren’t
previously in the Warsaw Pact) but the mean tells you about 32.14% of
the European Union in 2019 was previously in the Warsaw Pact.
select()
select()
is useful for basic (but important) data management. You can
use it to grab (or omit) columns from data. For example, let’s say I
wanted to grab all the columns in the data. I could do that with the
following command.
Data %>% select(everything()) # grab everything
#> # A tibble: 28 × 12
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7
#> 2 Belgium BE BEL 2019 0 27.2 -1.97 83.0 99.8 22.6
#> 3 Bulgaria BG BGR 2019 1 40.3 3.24 64.2 102. 20.6
#> 4 Croatia HR HRV 2019 0 28.9 6.47 50.5 94.1 22.0
#> 5 Cyprus CY CYP 2019 0 31.2 202. 75.8 87.5 23.1
#> 6 Czechia CZ CZE 2019 1 25.3 4.19 72.1 99.6 14.5
#> 7 Denmark DK DNK 2019 0 27.7 -1.10 59.0 95.6 34.9
#> 8 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8
#> 9 Finland FI FIN 2019 0 27.7 6.10 40.6 97.0 20.8
#> 10 France FR FRA 2019 0 31.2 1.96 32.9 93.9 24.6
#> # ℹ 18 more rows
#> # ℹ 2 more variables: gdp <dbl>, pop <dbl>
Do note this is kind of a redundant command. You could just as well spit the entire data into the console and it would’ve done the same thing. Still, here’s if I wanted everything except the two-character ISO code. I’m more of a three-character guy myself.
Data %>% select(-iso2c) # grab everything, but drop the public variable.
#> # A tibble: 28 × 11
#> country iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AUT 2019 0 30.2 -2.85 55.9 102. 25.7 4.13e11
#> 2 Belgium BEL 2019 0 27.2 -1.97 83.0 99.8 22.6 4.94e11
#> 3 Bulgaria BGR 2019 1 40.3 3.24 64.2 102. 20.6 5.72e10
#> 4 Croatia HRV 2019 0 28.9 6.47 50.5 94.1 22.0 5.78e10
#> 5 Cyprus CYP 2019 0 31.2 202. 75.8 87.5 23.1 2.52e10
#> 6 Czechia CZE 2019 1 25.3 4.19 72.1 99.6 14.5 2.17e11
#> 7 Denmark DNK 2019 0 27.7 -1.10 59.0 95.6 34.9 3.32e11
#> 8 Estonia EST 2019 1 30.8 9.77 72.0 NA 20.8 2.73e10
#> 9 Finland FIN 2019 0 27.7 6.10 40.6 97.0 20.8 2.53e11
#> 10 France FRA 2019 0 31.2 1.96 32.9 93.9 24.6 2.61e12
#> # ℹ 18 more rows
#> # ℹ 1 more variable: pop <dbl>
Here’s a more typical case. Assume you’re working with a large data
object and you just want a handful of things. In this case, we have
these variables, but we want just the identifier variables and the
gini
column for income inequality. We want to drop everything else.
Here’s how we’d do that in the select()
function, again with some
assistance from the pipe.
Data %>% select(country:gini) # grab country, gini, and everything in between it.
#> # A tibble: 28 × 6
#> country iso2c iso3c year wp gini
#> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2
#> 2 Belgium BE BEL 2019 0 27.2
#> 3 Bulgaria BG BGR 2019 1 40.3
#> 4 Croatia HR HRV 2019 0 28.9
#> 5 Cyprus CY CYP 2019 0 31.2
#> 6 Czechia CZ CZE 2019 1 25.3
#> 7 Denmark DK DNK 2019 0 27.7
#> 8 Estonia EE EST 2019 1 30.8
#> 9 Finland FI FIN 2019 0 27.7
#> 10 France FR FRA 2019 0 31.2
#> # ℹ 18 more rows
Grouped functions using .by
arguments
I think the pipe is probably the most important function in the
{tidyverse}
even as a critical reader might note that the pipe is 1) a
port from another package ({magrittr}
) and 2) now a part of base R in
a different terminology. Thus, the critical reader (and probably me,
depending on my mood) may note that grouped functions/arguments serve as
probably the most important component of the {tidyverse}
. It used to
be group_by()
that did this, but now most functions in the
{tidyverse}
have .by
arguments that are arguably more efficient for
this purpose. Basically, grouping the data—either with the deprecated
group_by()
or .by
argument—allows you to “split” the data into
various subsets, “apply” various functions to them, and “combine” them
into one output. You might see that terminology “split-apply-combine” as
you learn more about the {tidyverse}
and its development.
Here, let’s do a simple exercise : slice()
. slice()
lets you index
rows by integer locations (or through other means) and can be useful for
peeking into the data or curating it (by doing something like removing
duplicate observations). In this simple case, we’re going to slice the
data by the first observation at each level of the wp
variable. The
wp
variable communicates whether an observation was in the Warsaw Pact
(or was a state covered by the Warsaw Pact by way of being a former
republic of the Soviet Union).
# Notice we can chain some pipes together
Data %>%
# Get me the first observation, by group.
slice(1, .by=wp)
#> # A tibble: 2 × 12
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7 4.13e11
#> 2 Bulgar… BG BGR 2019 1 40.3 3.24 64.2 102. 20.6 5.72e10
#> # ℹ 1 more variable: pop <dbl>
If you don’t group-by the category first, slice(., 1)
will just return
the first observation in the data set.
Data %>%
# Get me the first observation for each values of the apply variable
slice(1) # womp womp. Forgot to use the .by argument
#> # A tibble: 1 × 12
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7 4.13e11
#> # ℹ 1 more variable: pop <dbl>
I think slice()
is a hidden gem and offer it the way I often use it
(mostly by row indexing), but you can also use it as you would
filter()
later in the script. For example, here’s how you can use it
to identify the highest GDP by levels of the wp
variable. For time
constraints, I’m going to leave it to you to understand what’s happening
here in more detail.
Data %>% slice(which(gdp == max(gdp)), .by=wp)
#> # A tibble: 2 × 12
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Germany DE DEU 2019 0 31.8 1.91 42.4 95.5 11.2 3.67e12
#> 2 Poland PL POL 2019 1 28.8 3.15 52.6 92.4 17.1 5.78e11
#> # ℹ 1 more variable: pop <dbl>
filter()
would be more efficient, but slice()
can do some of that
too.
summarize()
summarize()
creates condensed summaries of your data, for whatever it
is that you want. Here, for example, is a kind of dumb way of seeing how
many observations are in the data. nrow(Data)
works just as well, but
alas…
Data %>%
# How many observations are in the data?
summarize(n = n())
#> # A tibble: 1 × 1
#> n
#> <int>
#> 1 28
# How many observations are there by levels of the apply variable?
Data %>%
summarize(n = n(), .by=wp)
#> # A tibble: 2 × 2
#> wp n
#> <dbl> <int>
#> 1 0 19
#> 2 1 9
What you did, indirectly here, was find the mode of the apply variable.
This is the most frequently occurring value in a variable, which is
really only of interest to unordered-categorical or ordered-categorical
variables. Statisticians really don’t care about the mode, and it’s why
there is no real built-in function in base R that says “Here’s the
mode.” You have to get it indirectly. More importantly, summarize()
works wonderfully with the .by
argument. For example, for each country
in the EU, by their former Warsaw Pact status, let’s identify the
average GINI and the average exports as a % of GDP.
Data %>%
# Give me the average GINI and Exports/GDP by each value of `wp`
summarize(mean_gini = mean(gini, na.rm = TRUE),
mean_exppgdp = median(exppgdp, na.rm = TRUE),
.by = wp)
#> # A tibble: 2 × 3
#> wp mean_gini mean_exppgdp
#> <dbl> <dbl> <dbl>
#> 1 0 30.7 50.5
#> 2 1 31.4 72.0
This summary tells you that European Union states generally have the
same level of income inequality whether they were previously in the
Warsaw Pact or not, but exports are a larger share of GDP for states
that were formerly in the Warsaw Pact compared to states that were not.
That’s not terribly surprising to me, given what we know about the
endowments of the former Warsaw Pact states relative to states in the EU
that are more “Western”. One downside (or feature, depending on your
perspective) to summarize()
is that it condenses data and discards
stuff that’s not necessary for creating the condensed output. In the
case above, notice we didn’t ask for anything else about the data, other
than the average GINI and exports/GDP by each value of the wp
variable. Thus, we didn’t get anything else. Use it with that in mind.
mutate()
mutate()
is probably the most important {tidyverse}
function for
data management/recoding. It will allow you to create new columns while
retaining the original dimensions of the data. Consider it the sister
function to summarize()
. But, where summarize()
discards, mutate()
retains.
Let’s do something simple with mutate()
. For example, we can create a
new variable for GDP per capita based on the information we have. We
have GDP. We have population size. They are both in the same units. We
just need to divide one over the other. Watch how we’d do that here.
Data %>%
mutate(gdppc = gdp/pop)
#> # A tibble: 28 × 13
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7
#> 2 Belgium BE BEL 2019 0 27.2 -1.97 83.0 99.8 22.6
#> 3 Bulgaria BG BGR 2019 1 40.3 3.24 64.2 102. 20.6
#> 4 Croatia HR HRV 2019 0 28.9 6.47 50.5 94.1 22.0
#> 5 Cyprus CY CYP 2019 0 31.2 202. 75.8 87.5 23.1
#> 6 Czechia CZ CZE 2019 1 25.3 4.19 72.1 99.6 14.5
#> 7 Denmark DK DNK 2019 0 27.7 -1.10 59.0 95.6 34.9
#> 8 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8
#> 9 Finland FI FIN 2019 0 27.7 6.10 40.6 97.0 20.8
#> 10 France FR FRA 2019 0 31.2 1.96 32.9 93.9 24.6
#> # ℹ 18 more rows
#> # ℹ 3 more variables: gdp <dbl>, pop <dbl>, gdppc <dbl>
Again, the world is your oyster here. We can also create another variable to identify Southern European countries of Portugal, Spain, Italy, and Greece. Looking ahead, you can see how this would also create a variable for something like the Nordic countries in the European Union, though you’d have to change a few things to make it work (the information is still there).
Data %>%
mutate(southeurope = ifelse(iso2c %in% c("GR", "IT", "PT", "ES"), 1, 0)) %>%
filter(southeurope == 1) # Did this work the way I wanted?
#> # A tibble: 4 × 13
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Greece GR GRC 2019 0 33.1 2.41 39.6 88.7 25.9 2.06e11
#> 2 Italy IT ITA 2019 0 34.6 1.77 30.9 94.6 24.5 1.92e12
#> 3 Portug… PT PRT 2019 0 32.8 4.30 43.7 96.4 22.2 2.22e11
#> 4 Spain ES ESP 2019 0 34.3 2.16 34.7 95.9 13.7 1.33e12
#> # ℹ 2 more variables: pop <dbl>, southeurope <dbl>
We can save/assign our work as follows.
Data %>%
mutate(gdppc = gdp/pop,
southeurope = ifelse(iso2c %in% c("GR", "IT", "PT", "ES"), 1, 0)) -> Data
Data
#> # A tibble: 28 × 14
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Austria AT AUT 2019 0 30.2 -2.85 55.9 102. 25.7
#> 2 Belgium BE BEL 2019 0 27.2 -1.97 83.0 99.8 22.6
#> 3 Bulgaria BG BGR 2019 1 40.3 3.24 64.2 102. 20.6
#> 4 Croatia HR HRV 2019 0 28.9 6.47 50.5 94.1 22.0
#> 5 Cyprus CY CYP 2019 0 31.2 202. 75.8 87.5 23.1
#> 6 Czechia CZ CZE 2019 1 25.3 4.19 72.1 99.6 14.5
#> 7 Denmark DK DNK 2019 0 27.7 -1.10 59.0 95.6 34.9
#> 8 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8
#> 9 Finland FI FIN 2019 0 27.7 6.10 40.6 97.0 20.8
#> 10 France FR FRA 2019 0 31.2 1.96 32.9 93.9 24.6
#> # ℹ 18 more rows
#> # ℹ 4 more variables: gdp <dbl>, pop <dbl>, gdppc <dbl>, southeurope <dbl>
filter()
filter()
is a great diagnostic tool for subsetting your data to look
at particular observations. Notice one little thing, especially if
you’re new to programming. The use of double-equal signs (==
) is for
making logical statements where as single-equal signs (=
) is for
object assignment or column creation. If you’re using filter()
, you’re
probably wanting to find cases where something equals something (==
),
is greater than something (>
), equal to or greater than something
(>=
), is less than something (<
), or is less than or equal to
something (<=
).
We can do something like find the highest GDP per capita of EU states by
different values of the wp
variable.
Data %>%
filter(gdppc == max(gdppc),
.by = wp)
#> # A tibble: 2 × 14
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8 2.73e10
#> 2 Luxemb… LU LUX 2019 0 34.2 253. 206. 99.6 26.5 6.66e10
#> # ℹ 3 more variables: pop <dbl>, gdppc <dbl>, southeurope <dbl>
Take out gdppc
above and insert gdp
and you’ll get the more elegant
way of doing what the slice(which())
example did above. We can also
see all the states that were previously in the Warsaw Pact.
Data %>% filter(wp == 1)
#> # A tibble: 9 × 14
#> country iso2c iso3c year wp gini fdipgdp exppgdp reer taxrevpgdp gdp
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Bulgar… BG BGR 2019 1 40.3 3.24 64.2 102. 20.6 5.72e10
#> 2 Czechia CZ CZE 2019 1 25.3 4.19 72.1 99.6 14.5 2.17e11
#> 3 Estonia EE EST 2019 1 30.8 9.77 72.0 NA 20.8 2.73e10
#> 4 Hungary HU HUN 2019 1 30 59.9 81.5 89.1 22.4 1.47e11
#> 5 Latvia LV LVA 2019 1 34.5 3.37 62.5 103. 22.1 2.93e10
#> 6 Lithua… LT LTU 2019 1 35.3 6.23 76.8 NA 19.9 4.90e10
#> 7 Poland PL POL 2019 1 28.8 3.15 52.6 92.4 17.1 5.78e11
#> 8 Romania RO ROU 2019 1 34.8 2.93 40.1 97.3 14.5 2.18e11
#> 9 Slovak… SK SVK 2019 1 23.2 2.15 91.8 101. 18.8 9.95e10
#> # ℹ 3 more variables: pop <dbl>, gdppc <dbl>, southeurope <dbl>