IRIII.2: Model Presentation

It’s never sufficient to spam R console output into a formal document. Instead, the student should invest energy into presenting important console output into a narrative format. In the context of the regression model, this is the regression table. {modelsummary} is going to be doing some heavy-lifting here with respect to a silly sample analysis I did around which students can pattern their final papers.

#     /\_/\           ___
#    = o_o =_______    \ \  -Model Presentation-
#     __^      __(  \.__) )
# (@)<_____>__(_____)____/

If it’s not installed, install it.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(modelsummary)
# library(kableExtra) # for extra formatting options, in this format.
options("modelsummary_factory_default" = "kableExtra")

^ Note: “tinytable” (c.f. {tinytable}) is now default, though previous versions of this were written around “kableExtra” (c.f. {kableExtra}). As you develop your skills here (and importantly move away from copy-pasting stuff into Word), you may want to more fully transition into {modelsummary}’s default settings. However, we should keep this tractable here. Also note that anyone viewing this on the public course website will note that there are additional formatting things we should be doing to make this presentable in that particular format. However, what’s offered here is fine for the intended format. We’ll be doing copy-pasting from RStudio into Word.

Load the data

WVS <- readRDS("~/Dropbox/teaching/eh1903-ir3/2/data/wv6-sweden-v20201117.rds")

^ Note: you have access to this, but I won’t know where you put it on your hard drive. You need to download it and load it.

Let’s grab the information I used in the example paper I wrote for you. Here’s where I have to emphasize you need to read the codebook. The codebook describes the variables included and tells you the basic information they are communicating. You have to read the codebook, whatever the codebook is.

WVS %>%
  select(
    #importance of democracy, justifiability of divorce
    V140, V205, 
    # age, sex, scale of incomes,
    V242, V240, V239, 
    # how often you pray, age completed education
    V146, V249) -> Data

^ read the codebook. It’ll tell you what you want.

Optional, but renames columns to be more informative. In most cases, you should seriously think about doing this because a lot of variable names in standing data sets are either vague or obscene to look at.

colnames(Data) <- c("impdem", "justdiv", "age", "sex", "inc", "pray", "educ")

Surveys typically make men to be 1 and women to be 2, but I’ve always hated this practice. This makes women to be 1 and men to be 0.

Data %>% mutate(sex = ifelse(sex == 2, 1, 0)) -> Data

Data
#> # A tibble: 1,206 × 7
#>    impdem justdiv   age   sex   inc  pray  educ
#>     <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     10      10    38     1     5     8    31
#>  2     10       4    51     1     5     3    29
#>  3      8      10    70     0     7     3    23
#>  4      6       1    21     1     6     2     6
#>  5     10      10    65     1     5     6    21
#>  6     10      10    51     1     5     2    32
#>  7     10      10    48     1     5     8    38
#>  8     10      10    81     0     7     8    27
#>  9      8      10    19     0     4     8    18
#> 10     10      10    61     0     8     8    18
#> # ℹ 1,196 more rows

Let’s run the linear models I described in the paper.

# First model
M1 <- lm(impdem ~ justdiv, Data)
# Full model
M2 <- lm(impdem ~ justdiv + age + sex + inc + pray + educ, Data)

summary(M1)
#> 
#> Call:
#> lm(formula = impdem ~ justdiv, data = Data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -8.4876  0.5124  0.5124  0.5993  1.2948 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  8.61826    0.15757  54.696  < 2e-16 ***
#> justdiv      0.08694    0.01807   4.811  1.7e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.428 on 1161 degrees of freedom
#>   (43 observations deleted due to missingness)
#> Multiple R-squared:  0.01955,    Adjusted R-squared:  0.0187 
#> F-statistic: 23.15 on 1 and 1161 DF,  p-value: 1.697e-06
summary(M2)
#> 
#> Call:
#> lm(formula = impdem ~ justdiv + age + sex + inc + pray + educ, 
#>     data = Data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -8.0004 -0.1078  0.3590  0.7373  2.0805 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6.992661   0.267222  26.168  < 2e-16 ***
#> justdiv      0.095031   0.018053   5.264 1.70e-07 ***
#> age          0.018337   0.002236   8.202 6.59e-16 ***
#> sex          0.180965   0.084742   2.135  0.03294 *  
#> inc          0.056494   0.022731   2.485  0.01309 *  
#> pray        -0.006103   0.019008  -0.321  0.74820    
#> educ         0.013148   0.005088   2.584  0.00989 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.365 on 1091 degrees of freedom
#>   (108 observations deleted due to missingness)
#> Multiple R-squared:  0.1064, Adjusted R-squared:  0.1015 
#> F-statistic: 21.65 on 6 and 1091 DF,  p-value: < 2.2e-16

Prepare a table of descriptive statistics of your data.

Papers make routine use of the table of descriptive statistics to communicate some basic features about the data. this is easily done in {modelsummary} with the datasummary_skim() function. The datasummary() function in the same package has more flexibility but this is easier to explain and its default format actually looks nicer.

Data %>% 
  # setNames(c("Imp. Dem.", "Justif. Divorce",
  #                      "Age", "Female", "Income", "Prayer Frequency",
  #                      "Age at End of Schooling")) %>%
  # na.omit %>%
  datasummary_skim(title = "Summary Statistics of the Data Used in this Analysis",
                   align = c("lcccccccc"),
                   histogram = TRUE,
                  # output = "tinytable",
                   notes = "You can put a footnote here.") 

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
impdem	11	1	9.3	1.5	1.0	10.0	10.0
justdiv	11	3	8.4	2.3	1.0	10.0	10.0
age	68	0	47.3	19.4	18.0	47.0	85.0
sex	2	0	0.5	0.5	0.0	1.0	1.0
inc	11	3	5.4	1.8	1.0	5.0	10.0
pray	9	1	6.4	2.3	1.0	8.0	8.0
educ	54	2	24.1	8.3	5.0	22.0	83.0

Be mindful that the data I supplied here are all numeric and the data has only what I want to summarize. This will try to (want to) summarize everything, so be mindful what you are asking it to do.

The histogram = TRUE is nice but often doesn’t print well to documents (and it won’t in Word as I’m going to show you to do it). So, you can use this (as a beginner) to get an idea of what your data look like and what issues you may have, but you should disable it for use.

Now, we’re going to do a few things. First, we’re going to “uncomment” that setNames() comment, which is going to quickly rename our variables to be something that would be insane for an analysis but easy for formatting a table. Next, we’re going to disable that (histogram = FALSE) and knock off a “c” from the align argument.

Data %>% 
  setNames(c("Imp. Dem.", "Justif. Divorce",
                       "Age", "Female", "Income", "Prayer Frequency",
                       "Age at End of Schooling")) %>%
  # na.omit %>%
  datasummary_skim(title = "Summary Statistics of the Data Used in this Analysis",
                   align = c("lccccccc"),
                   histogram = F,
                   notes = "You can put a footnote here.") 
#> Warning: The `histogram` argument is deprecated. Use `fun_numeric` instead.

	Unique	Missing Pct.	Mean	SD	Min	Median	Max
Imp. Dem.	11	1	9.3	1.5	1.0	10.0	10.0
Justif. Divorce	11	3	8.4	2.3	1.0	10.0	10.0
Age	68	0	47.3	19.4	18.0	47.0	85.0
Female	2	0	0.5	0.5	0.0	1.0	1.0
Income	11	3	5.4	1.8	1.0	5.0	10.0
Prayer Frequency	9	1	6.4	2.3	1.0	8.0	8.0
Age at End of Schooling	54	2	24.1	8.3	5.0	22.0	83.0

Looks nice, right? Let’s copy-paste it into a Word document. Some cosmetic things you’ll have to do yourself (e.g. potential centering and what-not). All you need is Ctrl-A, Ctrl-C, Ctrl-V (Cmd equivalent for you Mac users).

Make a plot or two or three

For what it’s worth, this descriptive statistics table is pointing you to potential issues you can encounter in your data. In this particular example, I see several things I’ll want to consider. For one, I see an överliggare there who said they finished schooling at 83 (which, fam…). I see that there is huge problems of left-skew. Most Swedes maximally value democracy, are maximally permissive about divorce, and don’t pray at all (per the codebook). I can already anticipate these are going to be issues I should at least acknowledge because I can suspect they’re going to point me to problems in my linear model.

At the least, I can offer a visual display of these. A bar chart should suffice.

Data %>%
  select(impdem) %>%
  na.omit %>%
  ggplot(.,aes(factor(impdem))) +
  geom_bar(fill="#9bb2ce", alpha=.8, color='black') +
  geom_text(stat='count', aes(label=..count..), vjust=-.5) +
 # theme_steve(style = 'generic') +
  theme_minimal() +
  labs(caption = "Data: World Values Survey in Sweden (2011, Wave 6)",
       x = "Values of the Importance of Democracy",
       y = "Count of Observations") +
  scale_y_continuous(limits = c(0,1000))
#> Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
#> ℹ Please use `after_stat(count)` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

Issues in the dependent variable will typically be the ones you should think about first and the most, but you can see these issues manifest elsewhere.

Data %>%
  select(justdiv) %>%
  na.omit %>%
  ggplot(.,aes(factor(justdiv))) +
  geom_bar(fill="#9bb2ce", alpha=.8, color='black') +
  geom_text(stat='count', aes(label=..count..), vjust=-.5) +
  # theme_steve(style = 'generic') +
  theme_minimal() +
  labs(caption = "Data: World Values Survey in Sweden (2011, Wave 6)",
       x = "Values of the Justifiability of Divorce",
       y = "Count of Observations") +
  scale_y_continuous(limits = c(0,750))

Now that you’ve created a graph that summarizes important features about your data, save it (in RStudio) to a PNG file. Then, in your Word document, grab it and move it in. You can also—if it pleases and sparkles—zoom into the plot, right-click, copy image, and paste it into your Word document. Choice is yours.

Create a regression table

To really impress me, you’ll need to have a regression table that summarizes the results, and that summary cannot (well, really, really should not) be a PrtScrn job. You should get comfortable with the modelsummary() function in R.

Its basic form looks something like this.

modelsummary(list(M1, M2))


(Intercept)	8.618	6.993
	(0.158)	(0.267)
justdiv	0.087	0.095
	(0.018)	(0.018)
age		0.018
		(0.002)
sex		0.181
		(0.085)
inc		0.056
		(0.023)
pray		-0.006
		(0.019)
educ		0.013
		(0.005)
Num.Obs.	1163	1098
R2	0.020	0.106
R2 Adj.	0.019	0.101
AIC	4132.9	3808.4
BIC	4148.1	3848.4
Log.Lik.	-2063.469	-1896.218
RMSE	1.43	1.36

Notice here that modelsummary() works best with list types, and lists are just super-flexible ways of corralling a diverse set of object types in R. Here, we have two regression summaries (M1, M2). We’re wrapping them in a list(). modelsummary() will do what it does with them.

There’s a lot we should really think about doing here. First, it may be useful to so-called “name” your regressions. In my sample paper, M1 is a simple bivariate linear model and M2 adds the control variables. I can name them within list() like this.

modelsummary(list("Bivariate Regression" = M1,
                  "Full Model" = M2))

	Bivariate Regression	Full Model
(Intercept)	8.618	6.993
	(0.158)	(0.267)
justdiv	0.087	0.095
	(0.018)	(0.018)
age		0.018
		(0.002)
sex		0.181
		(0.085)
inc		0.056
		(0.023)
pray		-0.006
		(0.019)
educ		0.013
		(0.005)
Num.Obs.	1163	1098
R2	0.020	0.106
R2 Adj.	0.019	0.101
AIC	4132.9	3808.4
BIC	4148.1	3848.4
Log.Lik.	-2063.469	-1896.218
RMSE	1.43	1.36

Next—and really important—thing I want to do is add asterisks to help me identify so-called statistical significance. There are some customization options here, but just add stars = TRUE here.

modelsummary(list("Bivariate Regression" = M1,
                  "Full Model" = M2),
             stars  = TRUE)

	Bivariate Regression	Full Model
(Intercept)	8.618***	6.993***
	(0.158)	(0.267)
justdiv	0.087***	0.095***
	(0.018)	(0.018)
age		0.018***
		(0.002)
sex		0.181*
		(0.085)
inc		0.056*
		(0.023)
pray		-0.006
		(0.019)
educ		0.013**
		(0.005)
Num.Obs.	1163	1098
R2	0.020	0.106
R2 Adj.	0.019	0.101
AIC	4132.9	3808.4
BIC	4148.1	3848.4
Log.Lik.	-2063.469	-1896.218
RMSE	1.43	1.36
p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Next, and also really important, thing I want to do is rename the variables to be something intuitive. This is the coef_map argument in modelsummary(). Of note: it will help you to build the simple version above so you know what you’re looking at. In coef_map, there’s a c() function that takes arguments effectively communicating that “what the thing is called in the model, on the left, is going to be called this new thing, on the right”. justdiv is my main independent variable, but “Justifiability of Divorce” is what I want to call it.

modelsummary(list("Bivariate Regression" = M1,
                  "Full Model" = M2),
             stars  = TRUE,
             coef_map = c("justdiv" = "Justifiability of Divorce",
                          "age" = "Age",
                          "educ" = "Age at End of Schooling",
                          "sex" = "Female",
                          "pray" = "Frequency of Prayer",
                          "inc" = "Scale of Incomes",
                          "(Intercept)" = "Intercept"))

	Bivariate Regression	Full Model
Justifiability of Divorce	0.087***	0.095***
	(0.018)	(0.018)
Age		0.018***
		(0.002)
Age at End of Schooling		0.013**
		(0.005)
Female		0.181*
		(0.085)
Frequency of Prayer		-0.006
		(0.019)
Scale of Incomes		0.056*
		(0.023)
Intercept	8.618***	6.993***
	(0.158)	(0.267)
Num.Obs.	1163	1098
R2	0.020	0.106
R2 Adj.	0.019	0.101
AIC	4132.9	3808.4
BIC	4148.1	3848.4
Log.Lik.	-2063.469	-1896.218
RMSE	1.43	1.36
p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Just as an FYI, this works in order you specify them (i.e. the justdiv variable is displayed first here) and if you don’t specify it, it gets dropped from the summary (i.e. check for typos!).

Finally, it’s worth clarifying that modelsummary() returns almost all goodness of fit statistic for the model (contingent on the model type) and there are often lots of these. You don’t need all of them. In the simple linear model, you really just want information about the number of observations at a minimum, and maybe one or both of the R-squares.

Use the gof_map argument to specify what you want. If you don’t know what they’re called, use broom::glance() to see what they’re called internally.

broom::glance(M2)
#> # A tibble: 1 × 12
#>   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
#>       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
#> 1     0.106         0.101  1.37      21.7 3.89e-24     6 -1896. 3808. 3848.
#> # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# I want, in order, "nobs", "r.squared", and "adj.r.squared".
modelsummary(list("Bivariate Regression" = M1,
                  "Full Model" = M2),
             stars  = TRUE,
             coef_map = c("justdiv" = "Justifiability of Divorce",
                          "age" = "Age",
                          "educ" = "Age at End of Schooling",
                          "sex" = "Female",
                          "pray" = "Frequency of Prayer",
                          "inc" = "Scale of Incomes",
                          "(Intercept)" = "Intercept"),
             gof_map = c("nobs", "adj.r.squared", "r.squared"))

	Bivariate Regression	Full Model
Justifiability of Divorce	0.087***	0.095***
	(0.018)	(0.018)
Age		0.018***
		(0.002)
Age at End of Schooling		0.013**
		(0.005)
Female		0.181*
		(0.085)
Frequency of Prayer		-0.006
		(0.019)
Scale of Incomes		0.056*
		(0.023)
Intercept	8.618***	6.993***
	(0.158)	(0.267)
Num.Obs.	1163	1098
R2 Adj.	0.019	0.101
R2	0.020	0.106
p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Finally, you may want to add a caption that explains that this a regression table. You’ll want to call it something better than what I call it here.

modelsummary(list("Bivariate Regression" = M1,
                  "Full Model" = M2),
             stars  = TRUE,
             coef_map = c("justdiv" = "Justifiability of Divorce",
                          "age" = "Age",
                          "educ" = "Age at End of Schooling",
                          "sex" = "Female",
                          "pray" = "Frequency of Prayer",
                          "inc" = "Scale of Incomes",
                          "(Intercept)" = "Intercept"),
             gof_map = c("nobs", "adj.r.squared", "r.squared"),
             title = "Hi Mom!")

Hi Mom!
	Bivariate Regression	Full Model
Justifiability of Divorce	0.087***	0.095***
	(0.018)	(0.018)
Age		0.018***
		(0.002)
Age at End of Schooling		0.013**
		(0.005)
Female		0.181*
		(0.085)
Frequency of Prayer		-0.006
		(0.019)
Scale of Incomes		0.056*
		(0.023)
Intercept	8.618***	6.993***
	(0.158)	(0.267)
Num.Obs.	1163	1098
R2 Adj.	0.019	0.101
R2	0.020	0.106
p < 0.1, * p < 0.05, p < 0.01, * p < 0.001

Hi Mom!

Dope, let’s put it in our Word document. In RStudio, click on that viewer of the table. Then: Ctrl-A, Ctrl-C, Ctrl-V into your Word document. For you Mac users, this should be Cmd instead of Ctrl.