IRIII.2: Linear Models (Again)

This lab script teaches the implementation and interpretation of the basic linear model by reference to Braithwaite’s (2006) analysis of the geographic spread of militarized interstate disputes. Students should read that article to understand the context of what we’re doing. This lab script will just implement one of their basic models and unpack its contents.

#          ___
#      _.-|   |          |\__/,|   (`\
#     {   |   |          |o o  |__ _) ) # not again...
#      "-.|___|        _.( T   )  `  /
#       .--'-`-.     _((_ `^--' /_<  \
#     .+|______|__.-||__)`-'(((/  (((/
#
# Lab 4b: Linear Models (Again)

If it’s not installed, install it.

library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
#> ✔ lubridate 1.9.4     ✔ tidyr     1.3.0
#> ✔ purrr     1.1.0     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stevethemes)
library(stevemisc)
#> 
#> Attaching package: 'stevemisc'
#> 
#> The following object is masked from 'package:lubridate':
#> 
#>     dst
#> 
#> The following object is masked from 'package:dplyr':
#> 
#>     tbl_df

Load the data

This code aims to augment what students should’ve already learned in the previous lesson. I’ll be repeating myself because repetition will certainly help the student’s learning, but I hope any terseness (for the moment) is met with some understanding since it’s a fair bit of work to create new lessons on the fly. No matter, I hope students have learned by now how to read data from the internet. I’ve uploaded the data for this session to Athena (in both forms) and to my website. You have two options here if you’re going the route of my website. One is the Stata file and the other is the R serialized data frame (.rds). Let’s go with the R serialized data frame for this session. If you go this route, you have to include url() in the readRDS() function. Observe:

Braith <- readRDS(url("https://svmiller.com/extdata/braithwaite2006gsmd.rds"))
# Braith <- haven::read_dta("https://svmiller.com/extdata/braithwaite2006gsmd.dta")
# ^ also works. You can also download a copy of the data fraom Athena.

This part won’t be super fun, but this is the price of doing business in this world. Ideally, you read the article in question to get an idea of what Alex is doing. Joshua Alley has these data in his Github repository of simple cross-sectional OLS data sets. The original data and analyses were conducted in Stata, so some forensics are necessary here based on the Stata .do file that Alex wrote. We’re going for a replication of Table II, so let’s identify what exactly he did and get a reduced version of the data based on Alex’ description and the Stata .do file. I’ll do it for you this session but, in the Master’s level, I’ll ask you to figure stuff out like this yourself. It’s good experience.

Before we begin, though, I welcome anyone in the class as we’re doing this to tell me what this article is about.

Briefly, we’re aiming to reproduce this model, presented below in Stata syntax.

# reg log_radius_area territory logsize  host_mt host_for water cwpceyrs bord_vital  host_resource   if final_hostile>0, robust;

Okay, let’s look at our data…

Braith
#> # A tibble: 296 × 26
#>    dispnum dyadid ccode1 ccode2  year logcap_ratio allies joint_democ incidents
#>      <dbl>  <dbl>  <dbl>  <dbl> <dbl>        <dbl>  <dbl>       <dbl>     <dbl>
#>  1    4327  91093     91     93  2001        0.144      1           1         1
#>  2    4085 437475    437    475  1993        1.92       1           0         1
#>  3    4114 670679    670    679  1994        1.93       1           0         8
#>  4    4191 630640    630    640  1996        0.211      0           0         1
#>  5    4228 365700    365    700  1998        3.71       0           0         3
#>  6    4001 800811    800    811  1994        1.97       0           0        23
#>  7    4225 731740    731    740  1999        1.54       0           0         3
#>  8    4064 710713    710    713  1995        2.47       0           0        16
#>  9    4338 365373    365    373  1999        3.69       1           0         1
#> 10    4086 355705    355    705  1996        0.362      0           0         1
#> # ℹ 286 more rows
#> # ℹ 17 more variables: territory <dbl>, sdd <dbl>, radius <dbl>,
#> #   final_hostile <dbl>, jointsize <dbl>, log_radius_area <dbl>,
#> #   log_sdd_area <dbl>, pop_dense <dbl>, water <dbl>, logdurat <dbl>,
#> #   cwpceyrs <dbl>, host_mt <dbl>, host_for <dbl>, host_resource <dbl>,
#> #   bord_vital <dbl>, log_jointsize <dbl>, `_est_model1` <dbl>
names(Braith)
#>  [1] "dispnum"         "dyadid"          "ccode1"          "ccode2"         
#>  [5] "year"            "logcap_ratio"    "allies"          "joint_democ"    
#>  [9] "incidents"       "territory"       "sdd"             "radius"         
#> [13] "final_hostile"   "jointsize"       "log_radius_area" "log_sdd_area"   
#> [17] "pop_dense"       "water"           "logdurat"        "cwpceyrs"       
#> [21] "host_mt"         "host_for"        "host_resource"   "bord_vital"     
#> [25] "log_jointsize"   "_est_model1"

This is a good lesson for those of you who may have some experience with Stata from another program. Stata will guess stuff on your behalf. R won’t. In this case, Stata can infer you meant log_jointsize in the data when you asked for logsize. R 100% won’t do that for you. Beyond that, reducing the raw data to a simple form for the sake of the analysis isn’t that hard.

Braith %>%
  filter(final_hostile > 0) %>%
  select(log_radius_area, territory, log_jointsize,
         host_for, host_mt, water, cwpceyrs, bord_vital, 
           host_resource) -> Data

Data
#> # A tibble: 296 × 9
#>    log_radius_area territory log_jointsize host_for host_mt water cwpceyrs
#>              <dbl>     <dbl>         <dbl>    <dbl>   <dbl> <dbl>    <dbl>
#>  1            4.36         0          12.4   0         0        0        0
#>  2            4.36         0          17.1   0         0        0        1
#>  3           10.7          1          14.7   0.0686   44.2      0        4
#>  4            4.36         0          14.7   1.38     58.2      0        7
#>  5           10.2          0          16.7   0        13.6      0        0
#>  6           11.4          1          13.5  34.7      27.8      0        6
#>  7            4.36         0          13.1   0         0        0        0
#>  8           13.5          1          16.1  22.5      56.0      0        0
#>  9            4.36         0          16.6  10.9      49.2      0        5
#> 10            4.36         0          14.9   1.16      7.71     0        5
#> # ℹ 286 more rows
#> # ℹ 2 more variables: bord_vital <dbl>, host_resource <dbl>

How did we do?

summary(M1 <- lm(log_radius_area ~ territory + host_resource + host_mt + 
                   host_for + water + bord_vital + log_jointsize + cwpceyrs, 
                 data = Data))
#> 
#> Call:
#> lm(formula = log_radius_area ~ territory + host_resource + host_mt + 
#>     host_for + water + bord_vital + log_jointsize + cwpceyrs, 
#>     data = Data)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -6.192 -2.479 -1.190  2.768  9.494 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    4.496996   1.948533   2.308 0.021717 *  
#> territory      1.624430   0.412026   3.943 0.000101 ***
#> host_resource  0.984149   0.431537   2.281 0.023307 *  
#> host_mt        0.025871   0.007775   3.328 0.000990 ***
#> host_for      -0.012035   0.007177  -1.677 0.094620 .  
#> water         -0.600507   0.558129  -1.076 0.282863    
#> bord_vital    -0.809148   0.407514  -1.986 0.048032 *  
#> log_jointsize  0.135650   0.125150   1.084 0.279319    
#> cwpceyrs      -0.040498   0.012939  -3.130 0.001928 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 3.395 on 287 degrees of freedom
#> Multiple R-squared:  0.1448, Adjusted R-squared:  0.121 
#> F-statistic: 6.074 on 8 and 287 DF,  p-value: 3.154e-07

Got it. Don’t worry about the difference in the test statistics as we’re not going with robust standard errors for the sake of this exrecise. I explain a bit about what those are here.

Unpacking the Model Output

R presents a summary of the residuals at the top, though our eyes should go right to the coefficients first (the extent to which we are interested in hypothesis-testing). In Alex’ case, he’s interested in explaining the geographic spread of militarized interstate disputes through a variety of factors. He’s interested first, it seems, in the pernicious effect of territorial disputes. He hypothesizes that disputes over the allocation of territory are more likely to be dispersed over a geographic area than disputes over any other issues. The coefficient for that territory variable is 1.624. The standard error is .412. Dividing the coefficient over the standard error for 287 degrees of freedom results in a test statistic of 3.943. If there were truly no difference between disputes over territory and disputes over any other issue, the probability we observed what we observed is .000101. Thus, we reject the null hypothesis as incompatible with the data we observed and suggest Alex’ hypothesized relationship is closer to what it truly is. Since the dependent variable is log-transformed, you’ll want to think of some creative ways to express this relationship. I discuss these here. For example:

Toggling a dispute from over some other issue to the allocation of territory multiplies the geographic spread (as measured in the radius from the center of the dispute to its furthest incident) by exp(1.624) (about 5.075).
Toggling a dispute from over some other issue to the allocation of territory results in a (exp(1.624) - 1)*100 (about 407.33) percent increase in the geographic spread of the dispute (as measured by the radius from the center of the dispute to its furthest incident).

Since the coefficient isn’t close to 0, I would advise against using the rule of thumb of multiplying the coefficient by 100. You could, however, do that with the forest variable (host_for). Alex doesn’t argue this point with any obvious emphasis, but his treatment on pp. 510-11 implies that forest would be impassable terrain. This should constrain the spread of the dispute and a negative relationship should emerge. He measures this by the percentage of the territory that is covered in forest. The coefficient for this is -.012 and is significant at the .10 level in this model without robust standard errors. Thus, you can communicate the relative change, percentage change, and percentage change approximations by the following ways:

exp(-0.012035)-1       # relative change
#> [1] -0.01196287
(exp(-0.012035)-1)*100 # percentage change
#> [1] -1.196287
-0.012035*100          # percentage change approximation
#> [1] -1.2035

You can do something similar for the mountainous terrain variable, even as this has a different effect than the one hypothesized by Alex.

exp(0.025871)-1       # relative change
#> [1] 0.02620856
(exp(0.025871)-1)*100 # percentage change
#> [1] 2.620856
0.025871*100          # percentage change approximation
#> [1] 2.5871

If this is a bit much for you at this level, that’s cool too. You can just note what’s significant and in what direction and say something like “a one-unit change in the percentage of the territory that is covered in mountains coincides with an estimated change of .025 in the logged value of the radius between the center of the dispute to the furthest incident.” I’d be remiss, though, if I didn’t give you things to consider.

We should also make an obligatory reference to the stuff included at the bottom of the output. Namely, the adjusted R-squared suggests the model accounts for about 12% of the variation in geographic spread. We’ve also comfortably beat the mean-only model (see: the F-stat).

Again, with the Diagnostics

Let’s inspect the model output as we did in the previous lesson. Remember: the fitted-residual plot is the most “bang for your buck” linear model diagnostic. It’s where you should start.

plot(M1, which = 1)

Oh boy… that’s not good. Let’s pretty it up bit, just for presentation and added exposure to the augment() function in {broom}.

broom::augment(M1) %>%
  ggplot(.,aes(.fitted, .resid)) +
  geom_point(pch = 21) +
  theme_steve(style='generic') +
  geom_hline(yintercept = 0, linetype="dashed", color="red") +
  geom_smooth(method = "loess")
#> `geom_smooth()` using formula = 'y ~ x'

Remember what you should want to see in a fitted-residual plot. Know this isn’t it. That said, here’s what you can discern this is telling you and what it might mean for things you’d want to do with your model.

The first thing that grabs your attention is the clear line. Generally, a line like this of any kind in your fitted-residual plot is suggesting some kind of discrete pattern in the DV. Let’s see this for ourselves.

Data %>% ggplot(.,aes(log_radius_area)) + 
  geom_histogram() +
  theme_steve(style='generic')
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Data %>% arrange(log_radius_area)
#> # A tibble: 296 × 9
#>    log_radius_area territory log_jointsize host_for host_mt water cwpceyrs
#>              <dbl>     <dbl>         <dbl>    <dbl>   <dbl> <dbl>    <dbl>
#>  1            3.83         1          11.8    75.1    47.0      0        2
#>  2            4.36         0          12.4     0       0        0        0
#>  3            4.36         0          17.1     0       0        0        1
#>  4            4.36         0          14.7     1.38   58.2      0        7
#>  5            4.36         0          13.1     0       0        0        0
#>  6            4.36         0          16.6    10.9    49.2      0        5
#>  7            4.36         0          14.9     1.16    7.71     0        5
#>  8            4.36         0          14.7     0       8.60     0        0
#>  9            4.36         0          12.7    34.0    49.8      1        1
#> 10            4.36         0          13.8    86.8     4.70     0        0
#> # ℹ 286 more rows
#> # ℹ 2 more variables: bord_vital <dbl>, host_resource <dbl>

When you see something like this, it’s suggesting you have some kind of “hurdle” component to the data-generating process. There is a separate process that determines that big bar you see and then a separate one determining the variation in the flatter distribution to the right of it. Braithwaite’s third footnote is pointing to his 2005 data article in International Interactions that might provide more context for why this value of 4.363605 occurs so much in his data. I do not at all expect you to know exactly what to do under these circumstances. That said, here’s about what I’d expect you to say: “the fitted-residual plot suggests a clear discrete pattern in the DV and the histogram suggests two separate data-generating processes. Both strongly imply the linear model is a questionable fit for the data.”

The second thing that grabs your attention (maybe?) is the disagreement between the LOESS smoother and the flat line at 0. I don’t know if it’s no. 2, but it’s what we’ll talk about next. When you see something like this, it’s suggesting some kind of non-linear relationship in the data. It comes with three major caveats though. The first is that the LOESS smoother will always come off the line at 0 near the tails of the fitted-residual plot. Thus, we really shouldn’t say something with any authority yet. The second is that we obviously have the discrete clumping we noted above. That might be an important part of this. The third is that we 100% won’t know where until we looked at something a bit more informative. linloess_plot() in {stevemisc} will do this.

linloess_plot(M1, pch=21) +
  theme_steve(style='generic')
#> Warning: attributes are not identical across measure variables; they will be
#> dropped
#> `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'

Binary IVs will never be the issue here. It’s the “vital border” variable that looks like it has some weirdness. Based on what I remember of Starr (2002) and how Alex describes this on p. 514, it would suggest that the 0s are probably a distinct phenomenon that should be focused on in some detail. Beyond that, it really seems like the fitted-residual plot’s LOESS smoother is really just drawing attention to the “hurdle” component of the DV.

The third thing you can tell from the fitted-residual plot is that the variance of the residuals are clearly fanning out as the fitted values increase. That’s as tell-tale a sign of heteroskedasticity as you’re going to see. It’s also why Alex estimated this model with robust standard errors. If I remember Stata correctly, this is type “HC1”.

lmtest::coeftest(M1,  sandwich::vcovHC(M1,type='HC1'))
#> 
#> t test of coefficients:
#> 
#>                 Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)    4.4969959  1.6694475  2.6937 0.0074823 ** 
#> territory      1.6244297  0.4240554  3.8307 0.0001570 ***
#> host_resource  0.9841490  0.4425418  2.2239 0.0269364 *  
#> host_mt        0.0258709  0.0072926  3.5475 0.0004540 ***
#> host_for      -0.0120354  0.0067672 -1.7785 0.0763824 .  
#> water         -0.6005071  0.5610364 -1.0704 0.2853596    
#> bord_vital    -0.8091477  0.3978571 -2.0338 0.0428950 *  
#> log_jointsize  0.1356500  0.1122895  1.2080 0.2280269    
#> cwpceyrs      -0.0404980  0.0103090 -3.9284 0.0001072 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I don’t expect you to know what this does, but I’ve pointed you in the direction of what this does on my blog. Don’t sweat those details now. Just know it’s what a lot of practitioners do in the presence of non-constant error variance (heteroskedasticity).

Finally, we can check for whether the residuals are normally distributed. This assumption has no bearing for our test statistics or our line of best fit, but it’s nice to estimate a model that can reasonably fit the data.

plot(M1, which=2)

rd_plot(M1) + theme_steve(style='generic')

Again, a plot like this is wanting to alert us to the “hurdle” component of the DV. Rather than have normally distributed residuals, or something reasonably approximating it, we have two such peaks in our data. Non-normally distributed residuals strongly imply a non-normally distributed DV. We already knew we had that in this case.