I was basically done with this blog post when I came across Matti Vuorre’s post on the same exact topic. Matti goes into *all* the details, and really the present post can be seen as a brief account of all the cool things the probit-approach-to-SDT can do. I’m only posting this here because I really like my plots 🤷

Previously, we’ve seen that for data from a binary decision signal detection task, we can use a *probit binomial* regression model (like a logistic regression, but with a probit link function) to estimate the two main parameters of *signal detection theory* (SDT): the sensitivity and the bias.

In this post I would like to show how this idea can be extended to multiple response SDT tasks by using an *ordinal probit* regression model.

Imagine the following task: after being presented with 20 images of dogs, you are presented with 300 new images of dogs, and you have to decide for each dog if it appeared in the training set (“Old”) or not (“New”).

In a binary decision task, you would simply indicate “New” or “Old”, but in this task you have multiple response options - from 1 to 6, with 1 = “Feels New” and 6 = “Feels Old”. We can call this scale a “feelings numbers” scale.

After going over all 30 photos, you have

`STD_data`

```
# A tibble: 12 × 3
Truth Response N
<fct> <ord> <dbl>
1 New Confidence1 35
2 New Confidence2 31
3 New Confidence3 26
4 New Confidence4 22
5 New Confidence5 19
6 New Confidence6 17
7 Old Confidence1 14
8 Old Confidence2 20
9 Old Confidence3 22
10 Old Confidence4 27
11 Old Confidence5 32
12 Old Confidence6 35
```

Where *N* is the number of responses in each condition and response level.

We can use Siegfried Macho’s R code to extract the SDT parameters. In this case, they are:

*Sensitivity*- The distance between the two (latent) normal distributions. The further they are, the more “distinguishable” the Old and New images are from each other.

*5 Threshold*- One between each pair of consecutive possible responses. Perceived “stimulation” above each threshold leads to a decision in that category.

(These will probably make sense when we present them visually below.)

First, we’ll model this with classical SDT:

```
SDT_equal <- SDT.Estimate(
data = STD_data[["N"]],
test = TRUE,
# We have 2 option: Old / New; We'll assume equal variance
n = list(n.sdt = 2, restriction = "equalvar")
)
SDT.Statistics(SDT_equal)[["Free.parameters"]]
```

```
Value SE CFI-95(Lower) CFI-95(Upper)
Mean[2] 0.564 0.040 0.486 0.642
t-1 -0.744 0.034 -0.810 -0.678
t-2 -0.165 0.031 -0.226 -0.104
t-3 0.267 0.031 0.206 0.329
t-4 0.707 0.033 0.643 0.772
t-5 1.260 0.036 1.189 1.331
```

```
library(dplyr) # 1.1.0
library(tidyr) # 1.3.0
library(ordinal) # 2022.11.16
library(parameters) # 0.20.2
library(ggplot2) # 3.4.0
library(patchwork) # 1.1.2
```

We can also model this data with a Probit Cumulative Ordinal model, predicting the *Response* from the *Truth*: - The slope of *Truth* indicates the effect of the true image identity had on the response pattern - this is ** sensitivity**.

- In an ordinal model, we get

```
m_equal <- clm(Response ~ Truth,
data = STD_data,
weights = N,
link = "probit"
)
parameters::model_parameters(m_equal) |>
insight::print_html()
```

Model Summary | |||||

Parameter | Coefficient | SE | 95% CI | z | p |
---|---|---|---|---|---|

Intercept | |||||

Confidence1|Confidence2 | -0.74 | 0.10 | (-0.94, -0.54) | -7.21 | < .001 |

Confidence2|Confidence3 | -0.16 | 0.10 | (-0.35, 0.02) | -1.72 | 0.085 |

Confidence3|Confidence4 | 0.27 | 0.10 | (0.08, 0.46) | 2.81 | 0.005 |

Confidence4|Confidence5 | 0.71 | 0.10 | (0.51, 0.90) | 7.08 | < .001 |

Confidence5|Confidence6 | 1.26 | 0.11 | (1.05, 1.48) | 11.38 | < .001 |

Location Parameters | |||||

Truth (Old) | 0.57 | 0.12 | (0.33, 0.81) | 4.65 | < .001 |

As we can see, the estimated values are identical!^{1}

The advantage of the probit ordinal model is that it is easy(er) to build this model up:

- Add predictors of sensitivity (interactions with
`Truth`

)

- Add predictors of bias (main effects / intercept effects)

- Add random effects (with
`ordinal::clmm()`

)

(*See Matti’s post for actual examples!*)

```
mean2 <- coef(m_equal)[6]
Thresholds <- coef(m_equal)[1:5]
ggplot() +
# Noise
stat_function(aes(linetype = "Noise"), fun = dnorm,
size = 1) +
# Noise + Signal
stat_function(aes(linetype = "Noise + Signal"), fun = dnorm,
args = list(mean = mean2),
size = 1) +
# Thresholds
geom_vline(aes(xintercept = Thresholds, color = names(Thresholds)),
size = 2) +
scale_color_brewer("Threshold", type = "div", palette = 2,
labels = paste0(1:5, " | ", 2:6)) +
labs(y = NULL, linetype = NULL, x = "Obs. signal") +
expand_limits(x = c(-3, 3), y = 0.45) +
theme_classic()
```

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

The standard model of SDT assumes that the *Noise* and the *Noise + Signal* distribution differ only in their mean; that is, *N+S* is a shifted *N* distribution. This is almost always not true, with .

To deal with this, we can also estimate the variance of the *N+S* distribution.

First, with the classic SDT model:

```
SDT_unequal <- SDT.Estimate(
data = STD_data[["N"]],
test = TRUE,
# We have 2 option: Old / New; Not assuming equal variance
n = list(n.sdt = 2, restriction = "no")
)
SDT.Statistics(SDT_unequal)[["Free.parameters"]]
```

```
Value SE CFI-95(Lower) CFI-95(Upper)
Mean[2] 0.552 0.041 0.473 0.632
Stddev[2] 0.960 0.035 0.891 1.029
t-1 -0.728 0.036 -0.799 -0.658
t-2 -0.159 0.032 -0.221 -0.097
t-3 0.266 0.031 0.205 0.327
t-4 0.696 0.034 0.629 0.763
t-5 1.235 0.043 1.151 1.318
```

And with a probit ordinal regression, but allow the latent scale to vary:

```
m_unequal <- clm(Response ~ Truth,
scale = ~ Truth, # We indicate that the scale is a function of the underlying dist
data = STD_data,
weights = N,
link = "probit"
)
parameters::model_parameters(m_unequal) |>
insight::print_html()
```

Model Summary | |||||

Parameter | Coefficient | SE | 95% CI | z | p |
---|---|---|---|---|---|

Intercept | |||||

Confidence1|Confidence2 | -0.72 | 0.11 | (-0.93, -0.50) | -6.51 | < .001 |

Confidence2|Confidence3 | -0.16 | 0.10 | (-0.34, 0.03) | -1.61 | 0.107 |

Confidence3|Confidence4 | 0.27 | 0.10 | (0.08, 0.45) | 2.80 | 0.005 |

Confidence4|Confidence5 | 0.69 | 0.10 | (0.49, 0.90) | 6.66 | < .001 |

Confidence5|Confidence6 | 1.23 | 0.13 | (0.98, 1.49) | 9.50 | < .001 |

Location Parameters | |||||

Truth (Old) | 0.55 | 0.12 | (0.31, 0.80) | 4.49 | < .001 |

Scale Parameters | |||||

Truth (Old) | -0.05 | 0.12 | (0.31, 0.80) | 4.49 | < .001 |

The scale parameter needs to be back transformed to get the sd of the *N+S* distribution: , and so one again the estimated values are identical!

```
mean2 <- coef(m_unequal)[6]
sd2 <- exp(coef(m_unequal)[7])
Thresholds <- coef(m_unequal)[1:5]
ggplot() +
# Noise
stat_function(aes(linetype = "Noise"), fun = dnorm,
size = 1) +
# Noise + Signal
stat_function(aes(linetype = "Noise + Signal"), fun = dnorm,
args = list(mean = mean2, sd = sd2),
size = 1) +
# Thresholds
geom_vline(aes(xintercept = Thresholds, color = names(Thresholds)),
size = 2) +
scale_color_brewer("Threshold", type = "div", palette = 2,
labels = paste0(1:5, " | ", 2:6)) +
labs(y = NULL, linetype = NULL, x = "Obs. signal") +
expand_limits(x = c(-3, 3), y = 0.45) +
theme_classic()
```

An additional check we can preform is whether the various responses are indeed the product of single ROC curve. We do this by plotting the ROC curve on a inv-normal transformation (that is, converting probabilities into normal quantiles). Quantiles that fall on a straight line indicate they are part of the same curve.

```
pred_table <- data.frame(Truth = c("Old", "New")) |>
mutate(predict(m_unequal, newdata = cur_data(), type = "prob")[[1]] |> as.data.frame()) |>
tidyr::pivot_longer(starts_with("Confidence"), names_to = "Response") |>
tidyr::pivot_wider(names_from = Truth)
```

```
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `as.data.frame(predict(m_unequal, newdata = cur_data(), type =
"prob")[[1]])`.
Caused by warning:
! `cur_data()` was deprecated in dplyr 1.1.0.
ℹ Please use `pick()` instead.
```

```
ROC_table <- pred_table |>
rows_append(data.frame(New = 0, Old = 0)) |>
mutate(
Sensitivity = lag(cumsum(New), default = 0),
Specificity = rev(cumsum(rev(Old))),
)
p_roc <- ggplot(ROC_table, aes(Sensitivity, Specificity)) +
geom_line() +
geom_abline(slope = 1, intercept = 1, linetype = "dashed") +
geom_point(aes(color = ordered(Response)), size = 2) +
expand_limits(x = c(0,1), y = c(0,1)) +
scale_x_continuous(trans = "reverse") +
scale_color_brewer("Threshold", type = "div", palette = 2,
labels = paste0(1:5, " | ", 2:6),
na.translate = FALSE) +
labs(color = NULL) +
theme_classic()
p_zroc <- ROC_table |>
tidyr::drop_na(Response) |>
ggplot(aes(qnorm(Sensitivity), qnorm(Specificity))) +
geom_line() +
geom_point(aes(color = ordered(Response)), size = 2) +
expand_limits(x = c(0,1), y = c(0,1)) +
scale_x_continuous(trans = "reverse") +
scale_color_brewer("Threshold", type = "div", palette = 2,
labels = paste0(1:5, " | ", 2:6),
na.translate = FALSE) +
labs(color = NULL, x = "Z(Sensitivity)", y = "Z(Specificity)") +
theme_classic()
p_roc + p_zroc + plot_layout(guides = "collect")
```

`Warning: Removed 1 rows containing missing values (`geom_point()`).`

Although the `clmm()`

function allows for multilevel probit regression, it does not^{2} support varying scale parameter.

Alas, we *must* use `brms`

.

```
library(brms) # 2.18.0
library(ggdist) # 3.2.1.9000
library(tidybayes) # 3.0.3
```

In `brms`

we will use the `cumulative()`

family, which has a family-parameter called `disc`

which gives the standard deviation of the latent distributions. I will set some weak priors on the mean and standard deviation of the *N+S* distribution, and I will also set the standard deviation of the *N* distribution to 1 (on a log scale, to 0) using the `constant(0)`

prior.

```
b_formula <- bf(Response | weights(N) ~ Truth,
disc ~ 0 + Intercept + Truth)
b_priors <-
set_prior("normal(0, 3)", coef = "TruthOld") +
set_prior("normal(0, 1.5)", coef = "TruthOld", dpar = "disc") +
set_prior("constant(0)", coef = "Intercept", dpar = "disc")
Bayes_mod <- brm(b_formula,
prior = b_priors,
family = cumulative(link = "probit", link_disc = "log"),
data = STD_data,
backend = "cmdstanr",
refresh = 0
)
```

```
model_parameters(Bayes_mod, test = NULL) |>
insight::print_html()
```

Model Summary | ||||

Parameter | Median | 95% CI | Rhat | ESS |
---|---|---|---|---|

Intercept(1) | -0.74 | (-0.95, -0.53) | 1.002 | 2755.00 |

Intercept(2) | -0.17 | (-0.36, 0.02) | 1.000 | 4651.00 |

Intercept(3) | 0.27 | (0.08, 0.46) | 1.001 | 4694.00 |

Intercept(4) | 0.70 | (0.50, 0.91) | 1.001 | 3453.00 |

Intercept(5) | 1.26 | (1.01, 1.52) | 1.002 | 2442.00 |

TruthOld | 0.56 | (0.32, 0.82) | 1.000 | 3252.00 |

disc_Intercept | 0.00 | (0.00, 0.00) | ||

disc_TruthOld | 0.02 | (-0.21, 0.23) | 1.003 | 2098.00 |

```
criteria <- gather_rvars(Bayes_mod, b_Intercept[Response])
signal_dist <- spread_draws(Bayes_mod, b_TruthOld, b_disc_TruthOld) |>
mutate(b_disc_TruthOld = exp(b_disc_TruthOld)) |>
group_by(.draw) |>
reframe(
x = seq(-3, 3, length = 20),
d = dnorm(x, mean = b_TruthOld, b_disc_TruthOld)
) |>
ungroup() |>
curve_interval(.along = x, .width = 0.9)
ggplot() +
# Noise
stat_function(aes(linetype = "Noise"), fun = dnorm) +
# Noise + Signal
geom_ribbon(aes(x = x, ymin = .lower, ymax = .upper),
data = signal_dist,
fill = "grey", alpha = 0.4) +
geom_line(aes(x, d, linetype = "Noise + Signal"), data = signal_dist) +
# Thresholds
stat_slab(aes(xdist = .value, fill = ordered(Response)),
color = "gray", alpha = 0.6, key_glyph = "polygon",
data = criteria) +
# Theme and scales
scale_fill_brewer("Threshold", type = "div", palette = 2,
labels = paste0(1:5, " | ", 2:6),
na.translate = FALSE) +
labs(color = NULL, linetype = NULL, x = "Obs. signal", y = NULL) +
theme_classic()
```

This gives roughly the same results as `clm()`

, but would allow for multilevel modeling of both the location and scale of the latent variable.

Though the CIs are somewhat wider.↩︎

as of 2022-10-06↩︎

For some reason, the top Google results for *“ggplot2 pie chart”* show some very convoluted code to accomplish what should be easy:

- Make slices
- Add labels to the middle of those slices

Instead, let’s look at the easy way - with `position_stack()`

!

We first need some data fit for a pie chart - a column for slice label, and a column for their (preferably relative) size.

```
d <- data.frame(
Slices = c("Writing code", "Staring at plot", "Fixing code", "Enjoying plot") |> rep(2),
Time = c(1, 5, 4, 2, 1, 1, 1, 2),
When = c("Before reading this post", "After reading this post") |> rep(each = 4)
) |>
transform(
# Make time relative
Time_relative = Time / ave(Time, When, FUN = sum),
Slices = factor(Slices, levels = unique(Slices)),
When = factor(When, levels = unique(When))
)
d
```

```
#> Slices Time When Time_relative
#> 1 Writing code 1 Before reading this post 0.08333333
#> 2 Staring at plot 5 Before reading this post 0.41666667
#> 3 Fixing code 4 Before reading this post 0.33333333
#> 4 Enjoying plot 2 Before reading this post 0.16666667
#> 5 Writing code 1 After reading this post 0.20000000
#> 6 Staring at plot 1 After reading this post 0.20000000
#> 7 Fixing code 1 After reading this post 0.20000000
#> 8 Enjoying plot 2 After reading this post 0.40000000
```

```
library(ggplot2)
ggplot(d, aes(x = 1, y = Time_relative, fill = Slices)) +
facet_grid(cols = vars(When)) +
# Make pie
coord_polar(theta = "y") +
# Add the *stacked* columns
geom_col(position = position_stack(reverse = TRUE),
color = "tan3", linewidth = 3, show.legend = FALSE) +
# Add labels to the *stacked* position,
# in the middle of the column (vjust = 0.5)
geom_text(aes(label = Slices),
position = position_stack(vjust = 0.5, reverse = TRUE)) +
# Make it a pizza pie!
see::scale_fill_pizza_d() +
theme_void() +
labs(title = "Relative time spent building piecharts with ggplot2")
```

*Best served HOT!*

Let’s look at the following plot:

```
library(dplyr)
bfi <- psychTools::bfi %>%
mutate(
O = across(starts_with("O")) %>% rowMeans(na.rm = TRUE),
C = across(starts_with("C")) %>% rowMeans(na.rm = TRUE),
E = across(starts_with("E")) %>% rowMeans(na.rm = TRUE),
A = across(starts_with("A")) %>% rowMeans(na.rm = TRUE),
N = across(starts_with("N")) %>% rowMeans(na.rm = TRUE)
) %>%
mutate(
gender = factor(gender, labels = c("Man", "Woman")),
education = factor(education, labels = c("HS", "finished HS", "some college", "college graduate", "graduate degree"))
) %>%
select(gender, education, age, O:N) %>%
tidyr::drop_na(education) %>%
# multiply the data set
sample_n(size = 10000, replace = TRUE) %>%
# and add some noise
mutate(across(O:N, \(x) x + rnorm(x, 0, sd(x))))
```

```
library(ggplot2)
theme_set(theme_bw())
base_plot <- ggplot(bfi, aes(age, O, color = education)) +
facet_wrap(facets = vars(gender)) +
coord_cartesian(ylim = c(1, 6)) +
scale_color_viridis_d()
base_plot +
geom_point(shape = 16, alpha = 0.1) +
geom_smooth(se = FALSE)
```

This is a busy plot. It’s hard to see what the each prediction line is doing because there are so many of them, and it’s hard to make out the scatter plot behind the lines due to there being so many dots.

We might be tempted to pre-process some data in some way and pass it to each layer via the `data=`

argument in `geom_point/smooth`

. However, that gets ugly fast (did I clean the outliers in all the datasets for this plot?).

Instead, we can take advantage of the fact that the `data=`

argument can take a **function** to pre-process that plot data before plotting:

A

`function`

will be called with a single argument, the plot data. The return value must be a`data.frame`

, and will be used as the layer data. A`function`

can be created from a formula (e.g.`~ head(.x, 10)`

).

Let’s look at two examples.

For example, we can pass a filtering function (here as a formula) to only get the prediction lines of two categories.

```
library(dplyr)
base_plot +
geom_point(shape = 16, alpha = 0.1) +
geom_smooth(
se = FALSE,
# looking only at HS vs collage graduates
data = ~ filter(.x, education %in% c("finished HS", "college graduate"))
)
```

We can also randomly select some points (note that the prediction lines are still based on the *full* dataset).

```
base_plot +
geom_point(
shape = 16, alpha = 0.2,
# sample 10% from each group of `education`+`gender`
data = ~ group_by(.x, education, gender) %>% sample_frac(0.1)
) +
geom_smooth(se = FALSE) +
# Let the reader know what you've done
labs(caption = "Scatter plot contains only 10% of the data for ease of viewing.
Prediction lines are based on the full dataset.")
```

You might also be interested in `gghighlight`

for plotting all of the data, but highlighting only some of it.

The ANOVA is part of a wider family of statistical procedures that include ANCOVA (which incorporates continuous predictors) and Analysis of Deviance (which allow for non-continuous outcomes^{1}). This family of procedures all produce an ANOVA table (or ANOVA-like table) which summarizes the relationship between the underlying model and the outcome by partitioning the variation in the outcome into components which can be uniquely attributable to different sources according to the *law of total variance*. Essentially, each of the model’s terms is represented in a line in the ANOVA table which answers the question *how much of the variation in can be attributed to the variation in ?*?^{2} Where applicable, each source of variance has an accompanying test statistic (often *F*), sometimes called the omnibus test, which indicates the significance of the variance attributable to that term, often accompanied by some measure of effect size.

In this post I will demonstrate the various types of ANOVA tables, how R does ANOVA (what the defaults are, and how to produce alternatives).

Along the way, I hope to illustrate the applicability of ANOVA tables to other types of models - besides the classical case of a maximal model (all main effects and interactions) with strictly categorical predictors and a continuous outcome.

Here are some assumptions I make about you, the reader, in this post:

- You’re familiar with the ideas of multi-factor ANOVAs (what a main effect is, what interactions are…).
- You know some
`R`

- how to fit a linear model, how to wrangle some data. ~~You are IID and normally distributed.~~

Let’s dive right in!

```
d <- tibble::tribble(
~id, ~group, ~X, ~Z, ~Rx, ~condition, ~Y,
1L, "Gb", "102", 1L, "Placebo", "Ca", "584.07",
2L, "Ga", "52", 1L, "Placebo", "Cb", "790.29",
3L, "Gb", "134", 2L, "Dose100", "Ca", "875.76",
4L, "Gb", "128", 3L, "Dose100", "Cb", "848.37",
5L, "Ga", "78", 1L, "Dose250", "Ca", "270.42",
6L, "Gb", "150", 2L, "Dose250", "Cb", "999.87",
7L, "Ga", "73", 1L, "Placebo", "Ca", "364.1",
8L, "Ga", "87", 7L, "Placebo", "Cb", "420.84",
9L, "Gb", "115", 6L, "Dose100", "Ca", "335.78",
10L, "Gb", "113", 4L, "Dose100", "Cb", "627",
11L, "Gc", "148", 3L, "Dose250", "Ca", "607.79",
12L, "Gc", "82", 3L, "Dose250", "Cb", "329.32",
13L, "Ga", "139", 1L, "Placebo", "Ca", "335.56",
14L, "Ga", "65", 2L, "Placebo", "Cb", "669.04",
15L, "Gb", "139", 1L, "Dose100", "Ca", "405.04",
16L, "Gc", "96", 1L, "Dose100", "Cb", "367.15",
17L, "Gb", "50", 5L, "Dose250", "Ca", "27.37",
18L, "Gc", "90", 2L, "Dose250", "Cb", "468.69",
19L, "Ga", "90", 2L, "Placebo", "Ca", "584.67",
20L, "Ga", "116", 2L, "Placebo", "Cb", "277.71",
21L, "Gb", "78", 2L, "Dose100", "Ca", "266.01",
22L, "Gb", "60", 1L, "Dose100", "Cb", "0.04",
23L, "Gc", "112", 4L, "Dose250", "Ca", "593.25",
24L, "Ga", "63", 4L, "Dose250", "Cb", "512.26",
25L, "Ga", "89", 1L, "Placebo", "Ca", "635.57",
26L, "Ga", "97", 2L, "Placebo", "Cb", "468.69",
27L, "Gc", "76", 3L, "Dose100", "Ca", "514.66",
28L, "Gb", "83", 1L, "Dose100", "Cb", "264.87",
29L, "Gc", "84", 4L, "Dose250", "Ca", "220.34",
30L, "Gb", "88", 1L, "Dose250", "Cb", "216.54"
)
```

```
d$id <- factor(d$id)
d$group <- factor(d$group)
d$Rx <- factor(d$Rx, levels = c("Placebo", "Dose100", "Dose250"))
d$condition <- factor(d$condition)
```

`dplyr::glimpse(d)`

```
#> Rows: 30
#> Columns: 7
#> $ id <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
#> $ group <fct> Gb, Ga, Gb, Gb, Ga, Gb, Ga, Ga, Gb, Gb, Gc, Gc, Ga, Ga, Gb, …
#> $ X <dw_trnsf> 102, 52, 134, 128, 78, 150, 73, 87, 115, 113, 148, 82, …
#> $ Z <int> 1, 1, 2, 3, 1, 2, 1, 7, 6, 4, 3, 3, 1, 2, 1, 1, 5, 2, 2, 2, …
#> $ Rx <fct> Placebo, Placebo, Dose100, Dose100, Dose250, Dose250, Placeb…
#> $ condition <fct> Ca, Cb, Ca, Cb, Ca, Cb, Ca, Cb, Ca, Cb, Ca, Cb, Ca, Cb, Ca, …
#> $ Y <dw_trnsf> 584.07, 790.29, 875.76, 848.37, 270.42, 999.87, 364.10,…
```

`m <- lm(Y ~ group + X, data = d)`

This is a multiple regression model with a covariable `X`

and a 3-level factor `group`

. We can summarize the results in a coefficient table (aka a “regression” table):

`summary(m)`

```
#>
#> Call:
#> lm(formula = Y ~ group + X, data = d)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -372.51 -166.47 -53.67 134.57 451.16
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 118.608 145.557 0.815 0.42256
#> groupGb -102.591 94.853 -1.082 0.28937
#> groupGc -92.384 107.302 -0.861 0.39712
#> X 4.241 1.504 2.820 0.00908 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218.8 on 26 degrees of freedom
#> Multiple R-squared: 0.2383, Adjusted R-squared: 0.1504
#> F-statistic: 2.711 on 3 and 26 DF, p-value: 0.06556
```

Or we can produce an ANOVA table:

`anova(m)`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> group 2 8783 4391 0.0918 0.912617
#> X 1 380471 380471 7.9503 0.009077 **
#> Residuals 26 1244265 47856
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

While the `group`

term had 2 parameters in the coefficient table, it now has a single test with a `Df`

of 2. This **omnibus** test can be thought of as representing the *total* significance of the two parameters combined!

By default `R`

calculates *type 1* sums of squares (SS) - these are also called *sequential SS*, because each term is attributed with a portion of the variation (represented by its SS) in that has not yet been attributed to any of the PREVIOUS terms! Thus, in our example, the effect of `X`

represents only what `X`

explains on top of what `group`

has already explained - the variance attributed to `X`

is strictly the variance that can be *uniquely* attributed to `X`

, controlling for `group`

; the effect of `group`

however does *not* represent its unique contribution to ’s variance, but instead its *total* contribution.

This means that although the following models have the same terms, they will produce different *type 1* ANOVA tables because those terms are *in a different order*:

`anova(lm(Y ~ group + X, data = d))`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> group 2 8783 4391 0.0918 0.912617
#> X 1 380471 380471 7.9503 0.009077 **
#> Residuals 26 1244265 47856
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(lm(Y ~ X + group, data = d))`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> X 1 325745 325745 6.8067 0.01486 *
#> group 2 63509 31754 0.6635 0.52353
#> Residuals 26 1244265 47856
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can recreate the ANOVA table above by building a sequence of models, and comparing them (see (Judd, McClelland, & Ryan, 2017)[https://doi.org/10.4324/9781315744131]):

```
m0 <- lm(Y ~ 1, data = d) # Intercept-only model
m1 <- lm(Y ~ group, data = d)
anova(m0, m1, m)
```

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ 1
#> Model 2: Y ~ group
#> Model 3: Y ~ group + X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 29 1633519
#> 2 27 1624737 2 8783 0.0918 0.912617
#> 3 26 1244265 1 380471 7.9503 0.009077 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(m) # Same SS values`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> group 2 8783 4391 0.0918 0.912617
#> X 1 380471 380471 7.9503 0.009077 **
#> Residuals 26 1244265 47856
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

There is also *type 2 SS* - also called *simultaneous SS*, because each term is attributed with a portion of the variation in that is not attributable to any of the other terms in the model - its unique contribution while controlling for the other terms. Type 2 SS can be obtained with the `Anova()`

function from the {`car`

} package:

`car::Anova(m, type = 2)`

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> group 63509 2 0.6635 0.523533
#> X 380471 1 7.9503 0.009077 **
#> Residuals 1244265 26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can recreate the above ANOVA table by building two sequences of models, and comparing them:

```
m_sans_X <- lm(Y ~ group, data = d)
m_sans_group <- lm(Y ~ X, data = d)
anova(m_sans_group, m) # Same SS as the type 2 test for group
```

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ X
#> Model 2: Y ~ group + X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 28 1307774
#> 2 26 1244265 2 63509 0.6635 0.5235
```

`anova(m_sans_X, m) # Same SS as the type 2 test for X`

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ group
#> Model 2: Y ~ group + X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 27 1624737
#> 2 26 1244265 1 380471 7.9503 0.009077 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Because the order of terms is usually of little importance, type 1 tests are rarely used in practice…

Unfortunately, they are R’s default…

Things get a bit more complicated when interactions are involved, as type 2 SS treat interactions differently than main effects:

`m_int <- lm(Y ~ group * X, data = d)`

Each main effect term is attributed with variance (its SS) that is unique to it and that is not attributable to any of the other main effects (simultaneously, as we’ve already seen) but *without* accounting for the variance attributable to interactions, while the SS of the interaction term represents its unique variance after accounting for the underlying main effects (sequentially). So we get the unique contribution of each main effect when controlling only for the other main effects, and the unique contribution of the interactions controlling for the already-included combined contribution of the main effects.

`car::Anova(m_int, type = 2)`

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> group 63509 2 1.3555 0.2768607
#> X 380471 1 16.2410 0.0004884 ***
#> group:X 682026 2 14.5566 7.246e-05 ***
#> Residuals 562240 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can again recreate this type 2 ANOVA table with model comparisons^{3}:

`anova(m_sans_group, m) # Same SS as the type 2 test for group`

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ X
#> Model 2: Y ~ group + X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 28 1307774
#> 2 26 1244265 2 63509 0.6635 0.5235
```

`anova(m_sans_X, m) # Same SS as the type 2 test for X`

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ group
#> Model 2: Y ~ group + X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 27 1624737
#> 2 26 1244265 1 380471 7.9503 0.009077 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(m, m_int) # Same SS as the type 2 test for group:X `

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ group + X
#> Model 2: Y ~ group * X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 26 1244265
#> 2 24 562240 2 682026 14.557 7.246e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

In designs of higher order, each “order” is tested in a similar *simultaneous-sequential* manner. E.g., in a 3-way design, all main effects (1st order) are tested *simultaneously* (accounting for one another), **then** all 2-way interactions (2nd order) are tested *simultaneously* (accounting for the main effects *and* one another), and **then** the 3-way interaction is tested (accounting for all main effects and 2-way interactions).

There is another type of simultaneous SS - the type 3 test, which treats interactions and main effects equally: the SS for each main effect or interaction is calculated as its unique contribution that is not attributable to any of the other effects in the model - main effects or interactions. So the effect of `X`

is its unique contribution while controlling both for `group`

*and* for `group:X`

!

However, remember how we previously saw that these methods in R actually produce omnibus tests for the combined effect of the *parameters of each term*. But in the `m_int`

model the parameters labeled `X`

, `groupGb`

, and `groupGc`

no longer represent parameters of the main effects - instead they are parameters of simple (i.e., conditional) effects!

`summary(m_int)`

```
#>
#> Call:
#> lm(formula = Y ~ group * X, data = d)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -370.18 -67.15 33.68 111.35 172.14
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 833.130 173.749 4.795 6.99e-05 ***
#> groupGb -1308.898 233.218 -5.612 8.90e-06 ***
#> groupGc -752.718 307.747 -2.446 0.0222 *
#> X -4.041 1.942 -2.081 0.0482 *
#> groupGb:X 13.041 2.419 5.390 1.55e-05 ***
#> groupGc:X 7.731 3.178 2.432 0.0228 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 153.1 on 24 degrees of freedom
#> Multiple R-squared: 0.6558, Adjusted R-squared: 0.5841
#> F-statistic: 9.146 on 5 and 24 DF, p-value: 5.652e-05
```

`X`

is the slope of`X`

(When both dummy variables are fixed at 0, as*when*`group=Ga`

`Ga`

is the reference level)

`groupGb`

is the difference between`group=Ga`

and`group=Gb`

*when*`X=0`

`groupGc`

is the difference between`group=Ga`

and`group=Gc`

*when*`X=0`

(Pay attention to the “when” - this is what makes them conditional.)

We can see that changing the reference group changes the test for `X`

:

```
d$group <- relevel(d$group, ref = "Gb")
m_int2 <- lm(Y ~ group * X, data = d)
car::Anova(m_int, type = 3)
```

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 538630 1 22.9922 6.994e-05 ***
#> group 738108 2 15.7536 4.269e-05 ***
#> X 101495 1 4.3325 0.04823 *
#> group:X 682026 2 14.5566 7.246e-05 ***
#> Residuals 562240 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`car::Anova(m_int2, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 219106 1 9.3528 0.005402 **
#> group 738108 2 15.7536 4.269e-05 ***
#> X 910646 1 38.8722 1.918e-06 ***
#> group:X 682026 2 14.5566 7.246e-05 ***
#> Residuals 562240 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

How can we resolve this?

By centering our predictors! Centering is transforming our data in such a way that 0 represents the overall mean. When this is done, conditioning on 0 is the same as conditioning on the overall mean = looking at the main effect!

For covariables this is easy enough:

`d$X_c <- d$X - mean(d$X) # or scale(d$X, center = TRUE, scale = FALSE)`

But how do we center a factor??

The answer is - use some type of orthogonal coding, for example `contr.sum()`

(effects coding). This makes the coefficients harder to interpret ^{4}, but we’re not looking at those anyway!

`contrasts(d$group) <- contr.sum`

Now when looking at type 3 tests, the main effects terms actually are main effects!

```
m_int3 <- lm(Y ~ group*X_c, data = d)
car::Anova(m_int3, type = 3)
```

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 4743668 1 202.4902 3.401e-13 ***
#> group 19640 2 0.4192 0.66231
#> X_c 143772 1 6.1371 0.02067 *
#> group:X_c 682026 2 14.5566 7.246e-05 ***
#> Residuals 562240 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Remember: ** ABC - Always Be Centering** (your predictors) - type 3 ANOVA tables make little sense without centering

Unfortunately, we can’t just build a model without an interaction term and use it to recreate the type 3 ANOVA. Instead, we need to actually build the model matrix (i.e., the design matrix), and drop the columns of each term in turn:

```
mm <- model.matrix(m_int2)
head(mm)
```

```
#> (Intercept) groupGa groupGc X groupGa:X groupGc:X
#> 1 1 0 0 102 0 0
#> 2 1 1 0 52 52 0
#> 3 1 0 0 134 0 0
#> 4 1 0 0 128 0 0
#> 5 1 1 0 78 78 0
#> 6 1 0 0 150 0 0
```

A type 3 test for a term, is equal to the comparison between a model without the parameters associated with that term and the **full model**:

```
m_sans_group <- lm(Y ~ mm[,-(2:3)], data = d)
m_sans_X <- lm(Y ~ mm[,-4], data = d)
m_sans_int <- lm(Y ~ mm[,-(5:6)], data = d)
anova(m_sans_group, m_int2) # Same SS as type 3 for group
```

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ mm[, -(2:3)]
#> Model 2: Y ~ group * X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 26 1300348
#> 2 24 562240 2 738108 15.754 4.269e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(m_sans_X, m_int2) # Same SS as type 3 for X`

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ mm[, -4]
#> Model 2: Y ~ group * X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 25 1472886
#> 2 24 562240 1 910646 38.872 1.918e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`anova(m_sans_int, m_int2) # Same SS as type 3 for group:X`

```
#> Analysis of Variance Table
#>
#> Model 1: Y ~ mm[, -(5:6)]
#> Model 2: Y ~ group * X
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 26 1244265
#> 2 24 562240 2 682026 14.557 7.246e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Compare to:

`car::Anova(m_int2, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 219106 1 9.3528 0.005402 **
#> group 738108 2 15.7536 4.269e-05 ***
#> X 910646 1 38.8722 1.918e-06 ***
#> group:X 682026 2 14.5566 7.246e-05 ***
#> Residuals 562240 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

As mentioned above, the distinction between types 2 and 3 comes from how they estimate main effects in the presence of interactions.

Let’s look at the following factorial design:

```
m_factorial <- lm(Y ~ condition * group, data = d,
# Another way to specify effects coding:
contrasts = list(condition = contr.sum,
group = contr.sum))
car::Anova(m_factorial, type = 2)
```

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> condition 12048 1 0.1840 0.6718
#> group 7165 2 0.0547 0.9469
#> condition:group 41204 2 0.3146 0.7330
#> Residuals 1571485 24
```

`car::Anova(m_factorial, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 5858845 1 89.4773 1.439e-09 ***
#> condition 3452 1 0.0527 0.8203
#> group 8913 2 0.0681 0.9344
#> condition:group 41204 2 0.3146 0.7330
#> Residuals 1571485 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

But where do these differences between type 2 and 3 come from?

Type 2 SS looks at the SS between the means of `A`

, *across* the levels of `B`

. So the marginal mean of the first group is estimated as:

Type 3 SS however looks at the SS between the means of `group`

, weighted by `condition`

. So the marginal mean of group `a`

is estimated as:

This makes type 3 SS invariant to the cell frequencies!

But as we will soon see, this need not always be the case…

A lot has been said about type 2 vs type 3. I will not go into the weeds here, but it is important to note that

- Most statistical softwares (SAS, Stata, SPSS, …) default to type 3 SS with orthogonal factor coding (but covariables are
*not*mean-centered in most cases by default) (see Langsrud, 2003). This makes`R`

inconsistent as we’ve seen it defaults to type 1 ANOVA and treatment coding.

- Often in factorial designs, any imbalance in the design is incidental, so it is often beneficial to have a method that is invariant to such imbalances. (Though this may not be true if the data is observational.)

- Coefficient tables give results that are analogous to type 3 SS when all terms are covariables:

```
m_covs <- lm(Y ~ X * Z, data = d)
car::Anova(m_covs, type = 2)
```

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> X 324676 1 6.7657 0.01513 *
#> Z 2430 1 0.0506 0.82373
#> X:Z 57640 1 1.2011 0.28315
#> Residuals 1247704 26
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

`car::Anova(m_covs, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 83130 1 1.7323 0.1996
#> X 5866 1 0.1222 0.7294
#> Z 59917 1 1.2486 0.2740
#> X:Z 57640 1 1.2011 0.2831
#> Residuals 1247704 26
```

`summary(m_covs) # same p-values as type 3`

```
#>
#> Call:
#> lm(formula = Y ~ X * Z, data = d)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -384.97 -153.72 -23.98 141.28 422.79
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 366.352 278.348 1.316 0.200
#> X 1.014 2.900 0.350 0.729
#> Z -112.681 100.843 -1.117 0.274
#> X:Z 1.175 1.072 1.096 0.283
#>
#> Residual standard error: 219.1 on 26 degrees of freedom
#> Multiple R-squared: 0.2362, Adjusted R-squared: 0.1481
#> F-statistic: 2.68 on 3 and 26 DF, p-value: 0.06772
```

The distinction between types 1, 2 and 3 SS is only relevant when there is some dependency between predictors (aka some collinearity). In our example, we can see that `group`

and `X`

are somewhat co-linear (VIF / tolerance are not strictly 1):

`performance::check_collinearity(m)`

```
#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> group 1.08 [1.00, 5.73] 1.04 0.92 [0.17, 1.00]
#> X 1.08 [1.00, 5.73] 1.04 0.92 [0.17, 1.00]
```

In a factorial design, we might call this dependence / collinearity among our predictors an “unbalanced design” (the number of observations differs between cells), and when the predictors are completely independent we would call this a “balanced design” (equal number of observations in all cells).

Let’s look at two examples:

We can see that `Rx`

and `condition`

are balanced:

`table(d$Rx, d$condition)`

```
#>
#> Ca Cb
#> Placebo 5 5
#> Dose100 5 5
#> Dose250 5 5
```

`chisq.test(d$Rx, d$condition)$statistic # Chisq is exactly 0`

```
#> X-squared
#> 0
```

And so type 1, 2 and 3 ANOVA tables are identical:

```
contrasts(d$Rx) <- contr.sum
contrasts(d$condition) <- contr.sum
m_balanced <- lm(Y ~ condition * Rx, data = d)
anova(m_balanced)
```

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> condition 1 13666 13666 0.2162 0.6461
#> Rx 2 41379 20690 0.3273 0.7240
#> condition:Rx 2 61444 30722 0.4860 0.6210
#> Residuals 24 1517030 63210
```

`car::Anova(m_balanced, type = 2)`

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> condition 13666 1 0.2162 0.6461
#> Rx 41379 2 0.3273 0.7240
#> condition:Rx 61444 2 0.4860 0.6210
#> Residuals 1517030 24
```

`car::Anova(m_balanced, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 6422803 1 101.6112 4.204e-10 ***
#> condition 13666 1 0.2162 0.6461
#> Rx 41379 2 0.3273 0.7240
#> condition:Rx 61444 2 0.4860 0.6210
#> Residuals 1517030 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

However, `condition`

and `group`

are NOT balanced:

`table(d$group, d$condition)`

```
#>
#> Ca Cb
#> Gb 6 6
#> Ga 5 6
#> Gc 4 3
```

`chisq.test(d$group, d$condition)$statistic # Chisq is NOT 0`

```
#> Warning in chisq.test(d$group, d$condition): Chi-squared approximation may be
#> incorrect
```

```
#> X-squared
#> 0.2337662
```

And so type 1, 2 and type 3 ANOVA tables are NOT identical (recall how type 2 and 3 estimate marginal means differently in the presence of interactions):

`anova(m_factorial)`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> condition 1 13666 13666 0.2087 0.6519
#> group 2 7165 3582 0.0547 0.9469
#> condition:group 2 41204 20602 0.3146 0.7330
#> Residuals 24 1571485 65479
```

`car::Anova(m_factorial, type = 2)`

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> condition 12048 1 0.1840 0.6718
#> group 7165 2 0.0547 0.9469
#> condition:group 41204 2 0.3146 0.7330
#> Residuals 1571485 24
```

`car::Anova(m_factorial, type = 3)`

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 5858845 1 89.4773 1.439e-09 ***
#> condition 3452 1 0.0527 0.8203
#> group 8913 2 0.0681 0.9344
#> condition:group 41204 2 0.3146 0.7330
#> Residuals 1571485 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

As the deviation from perfect balance (i.e. independence) is larger, so will the differences between the types increase.

We’ve seen that types 1, 2 and 3 all attribute variance in to the model’s terms by partialling out the model’s other terms in some way - sequentially, simultaneously, or some mix of both.

However, if the predictors in the model are independent (such as in a balanced design, zero collinearity), then regardless of the order of their inclusion in the model, all of the variance that can be attributable to some term A is unique to A - none of it is *also* attributable to any of the other term in the model, and vice versa - there is nothing to partial out, so the order does not matter!

This also means that because the SS returned by both type 2 and 3 ANOVA tables represent the terms’ uniquely attributable variation in , then **when the design is not balanced / there is collinearity in the data, the SS in the ANOVA table will not sum to the total SS** - as there is some overlap (some non-unique variation) that is not represented in the ANOVA table. However… this is where type 1 ANOVA tables shine, as their sequential nature means the SS in the ANOVA table

Let’s look at an extreme example of collinearity:

```
d2 <- MASS::mvrnorm(
n = 100,
mu = rep(0, 3),
Sigma = matrix(c(1, 0.99, 0.4,
0.99, 1, 0.41,
0.4, 0.41, 1), nrow = 3)
)
d2 <- data.frame(d2)
colnames(d2) <- c("X", "Z", "Y")
```

```
m_collinear <- lm(Y ~ X + Z, data = d2)
performance::check_collinearity(m_collinear)
```

```
#> # Check for Multicollinearity
#>
#> High Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> X 47.02 [32.64, 67.95] 6.86 0.02 [0.01, 0.03]
#> Z 47.02 [32.64, 67.95] 6.86 0.02 [0.01, 0.03]
```

Looking a type 1 ANOVA table we can see that `X`

accounts for a significant amount of variation in `Y`

, but that `Z`

does not *add* anything significant on top of `X`

:

`anova(m_collinear)`

```
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> X 1 16.927 16.9265 22.664 6.743e-06 ***
#> Z 1 0.000 0.0000 0.000 0.9999
#> Residuals 97 72.444 0.7468
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

However, were we to look at a type 2 ANOVA table, we might get the impression that neither `X`

nor `Z`

contribute to the model:

`car::Anova(m_collinear, type = 2)`

```
#> Anova Table (Type II tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> X 0.360 1 0.4822 0.4891
#> Z 0.000 1 0.0000 0.9999
#> Residuals 72.444 97
```

This demonstrates the importance of *always* interpreting type 2 and 3 ANOVA tables in light of any collinearity that might exist between your predictors; Remember: *ABC - Always Be mindful of Collinearity*_{(okay, that one was a bit of a stretch)}.

We can use the `lm()`

-> `car::Anova()`

method to conduct a proper ANOVA on a maximal factorial design. However, making sure that our factors are orthogonally coded is a pain in the @$$.

```
d$group <- factor(d$group)
d$condition <- factor(d$condition)
contrasts(d$group) <- contr.sum
contrasts(d$condition) <- contr.sum
m_lm <- lm(Y ~ group * condition, d)
car::Anova(m_lm, type = 3)
```

```
#> Anova Table (Type III tests)
#>
#> Response: Y
#> Sum Sq Df F value Pr(>F)
#> (Intercept) 5858845 1 89.4773 1.439e-09 ***
#> group 8913 2 0.0681 0.9344
#> condition 3452 1 0.0527 0.8203
#> group:condition 41204 2 0.3146 0.7330
#> Residuals 1571485 24
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Thankfully, we have the `afex`

package which turns all of that mess into something much more palatable:

`afex::aov_car(Y ~ group * condition + Error(id), data = d)`

```
#> Anova Table (Type 3 tests)
#>
#> Response: Y
#> Effect df MSE F ges p.value
#> 1 group 2, 24 65478.53 0.07 .006 .934
#> 2 condition 1, 24 65478.53 0.05 .002 .820
#> 3 group:condition 2, 24 65478.53 0.31 .026 .733
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '+' 0.1 ' ' 1
```

Much easier!

So far we’ve seen how ANOVA tables are applied to non-factorial linear OLS models. However the idea of omnibus tests per-term can be extended to many other types of models. For models where SS cannot be calculated, analogous methods based on deviance or likelihood are used instead (read more in the `car::Anova()`

docs). Here are some examples:

```
m_logistic <- glm(condition ~ group * X_c, data = d,
family = binomial())
car::Anova(m_logistic, type = 2)
```

```
#> Analysis of Deviance Table (Type II tests)
#>
#> Response: condition
#> LR Chisq Df Pr(>Chisq)
#> group 0.15679 2 0.9246
#> X_c 0.75134 1 0.3861
#> group:X_c 1.10225 2 0.5763
```

```
m_poisson <- glm(Z ~ group * X_c, data = d,
family = poisson())
car::Anova(m_poisson, type = 3)
```

```
#> Analysis of Deviance Table (Type III tests)
#>
#> Response: Z
#> LR Chisq Df Pr(>Chisq)
#> group 0.88074 2 0.6438
#> X_c 0.04370 1 0.8344
#> group:X_c 0.11357 2 0.9448
```

```
m_ordinal <- ordinal::clm(group ~ X_c * condition, data = d)
anova(m_ordinal)
```

```
#> Type I Analysis of Deviance Table with Wald chi-square tests
#>
#> Df Chisq Pr(>Chisq)
#> X_c 1 0.4469 0.5038
#> condition 1 0.1584 0.6907
#> X_c:condition 1 0.6129 0.4337
```

```
m_mixed <- lmerTest::lmer(Y ~ X_c * group + (group | Z), data = d)
anova(m_mixed, type = 2)
```

```
#> Type II Analysis of Variance Table with Satterthwaite's method
#> Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
#> X_c 290878 290878 1 23.645 14.4210 0.0008951 ***
#> group 7198 3599 2 9.520 0.1784 0.8393243
#> X_c:group 582071 291036 2 21.938 14.4288 0.0001001 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Also for GLMMs:

```
m_mixed2 <- lme4::glmer(condition ~ X_c * group + (1 | Z), data = d,
family = binomial())
car::Anova(m_mixed2, type = 3)
```

```
#> Analysis of Deviance Table (Type III Wald chisquare tests)
#>
#> Response: condition
#> Chisq Df Pr(>Chisq)
#> (Intercept) 0.0886 1 0.7660
#> X_c 1.1917 1 0.2750
#> group 0.0829 2 0.9594
#> X_c:group 0.9972 2 0.6074
```

Although we might be more inclined to summarize our model with an ANOVA table when our model contains categorical predictors, especially when interactions are involved, we’ve seen that ANOVA tables do not require any special data (factorial, balanced, normal outcome), and can be used as an alternative to a coefficient table.

Regardless of how we summarize our model - with a coefficient table or with an ANOVA table, using type 1, 2 or 3 SS, with orthogonal or treatment coding, with centered or uncentered covariables - our underlying model is equivalent - and will produce the same estimated simple effects, marginal means and contrasts. That is, the method we use to summarize our model will not have any bearing on whatever follow-up analysis we may wish to carry out (using `emmeans`

of course! Check out the materials from my R course: **Analysis of Factorial Designs**).^{6}

I hope you now have a fuller grasp of what goes on behind the scenes when producing ANOVA tables, how the different types of ANOVA tables work, when they should be used, and how to interpret their results. ANOVA tables are a powerful tool that can be applied not only to factorial data coupled with an OLS model, but also to a wide variety of (generalized) linear (mixed) regression models.

via the generalized linear model framework↩︎

What we might consider “unexplained” variance is

*also*attributed to some source; e.g. to variation between subjects.↩︎While the SS are the same, the test statistics are different - this is because

`car::Anova()`

uses the total error term of the full model for all of the tests.↩︎You might instead use

`contr.helmert()`

.↩︎Honestly, coefficient tables also make little sense without centering↩︎

It will also not alter the model’s predictions or .↩︎

```
Agreement <- matrix(c(794, 150, 86,
12, 888, 34,
570, 333, 23), nrow = 3,
dimnames = list(Before = c("Agree", "Meh", "Disagree"),
After = c("Agree", "Meh", "Disagree")))
```

Our question is how many people changed their minds. Statistically we might use `mcnemar.test()`

and `effectsize::cohens_g()`

, but we will be focusing on visualization of the data with `ggplot2`

.

We first need to re-structure this matrix into a data frame:

`(Agreement_df <- as.data.frame(as.table(Agreement)))`

```
#> Before After Freq
#> 1 Agree Agree 794
#> 2 Meh Agree 150
#> 3 Disagree Agree 86
#> 4 Agree Meh 12
#> 5 Meh Meh 888
#> 6 Disagree Meh 34
#> 7 Agree Disagree 570
#> 8 Meh Disagree 333
#> 9 Disagree Disagree 23
```

The basic plot is:

```
library(ggplot2)
theme_set(theme_bw())
ggplot(Agreement_df, aes(Before, Freq, fill = After)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1
)
```

Simple enough.

What we want to do is mark the cells where people did not change their response - where `Before`

is equal to `After`

- with a different line type. We can do this by adding `linetype = Before == After`

into the plots aesthetics. This *should* give diagonal cells a different line-type compared to the other cells. Simple enough, no?

```
ggplot(Agreement_df, aes(Before, Freq, fill = After)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1,
mapping = aes(linetype = Before == After) #<<<<<<<<<
)
```

What the hell happened?? The ** order** of cells has changed!

The first thing to understand is that we have some implicit grouping going on.

The group aesthetic is by default set to the interaction of all discrete variables in the plot. […] For most applications the grouping is set implicitly by mapping one or more discrete variables to

`x`

,`y`

,`colour`

,`fill`

,`alpha`

,`shape`

,`size`

, and/or`linetype`

.

*From the ggplot2 manual on Aesthetics: grouping*

This means that our mapping of `fill`

and `linetype`

has been used to set the `group`

ing of the cells.

The second thing to understand is the *order* in which these `group`

ing aesthetics are used for grouping:

- First, the layer-specific aesthetics are used (in our case,
`linetype = Before == After`

, which is in the`geom_col()`

layer). - Then (if
`inherit.aes = TRUE`

, which is the default) any global aesthetics are used (`fill = After`

, which is set in the call to`ggplot()`

).

This is why the order of the cells has changed: Cells were grouped first by the before-after equality, and only then by the type of “after” response.

The fix is easy, we have to make sure the grouping aesthetics are specified in a way that `ggplot`

pulls them in the correct order; that is first by “after” and then by the before-after equality.

Here are all the ways to do that:

We can explicitly set the `group`

aesthetic, using the `interaction()`

function, but to add insult to injury, this function must be supplied with the grouping variables in the *reverse* order (unless you set `lex.order = TRUE`

):

```
ggplot(Agreement_df, aes(Before, Freq, fill = After)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1,
mapping = aes(linetype = Before == After,
group = interaction(Before == After, After)) #<<<<<<<<<
)
```

```
ggplot(Agreement_df, aes(Before, Freq, fill = After)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1,
mapping = aes(linetype = Before == After,
group = interaction(After, Before == After, #<<<<<<<<<
lex.order = TRUE)) #<<<<<<<<<
)
```

We can also keep using the implicit setting for the grouping, but set all of the relevant aesthetics globally:

```
# Set both in the global aesthetics:
ggplot(Agreement_df, aes(Before, Freq,
fill = After, linetype = Before == After)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1
)
```

Or in the layer itself:

```
# Set both in the layer aesthetics:
ggplot(Agreement_df, aes(Before, Freq)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1,
mapping = aes(fill = After, linetype = Before == After)
)
```

Note then even when setting them globally or in the layer, the *order* still matters:

```
ggplot(Agreement_df, aes(Before, Freq)) +
geom_col(
position = "fill", width = 0.85,
color = "black", linewidth = 1,
mapping = aes(linetype = Before == After, fill = After) # Wrong order
)
```

The location (global or by layer) and order of aesthetics matters. I didn’t know this, and I felt like I was losing my mind; I hope that by writing this post I will be able to spare you some precious keyboard banging and yelps of sorrow.

Code away!

You have two predictors in your model. One seems to have a stronger coefficient than the other. But is it significant?

Example: when predicting a worker’s salary, is the standardized coefficient of *number of extra hours* (`xtra_hours`

) really larger than that of *number of compliments given the to boss* `n_comps`

?

```
library(parameters)
library(effectsize)
data("hardlyworking", package = "effectsize")
hardlyworkingZ <- standardize(hardlyworking)
m <- lm(salary ~ xtra_hours + n_comps, data = hardlyworkingZ)
model_parameters(m)
```

```
#> Parameter | Coefficient | SE | 95% CI | t(497) | p
#> --------------------------------------------------------------------
#> (Intercept) | 2.76e-16 | 0.02 | [-0.03, 0.03] | 1.59e-14 | > .999
#> xtra hours | 0.81 | 0.02 | [ 0.78, 0.84] | 46.60 | < .001
#> n comps | 0.41 | 0.02 | [ 0.37, 0.44] | 23.51 | < .001
```

Here are ~~4~~ 6 methods to test coefficient equality in `R`

.

*Based on this awesome tweet.*

Since:

We can essentially force a constraint on the coefficient to be equal by using a composite variable.

```
m0 <- lm(salary ~ I(xtra_hours + n_comps), data = hardlyworkingZ)
model_parameters(m0)
```

```
#> Parameter | Coefficient | SE | 95% CI | t(498) | p
#> -----------------------------------------------------------------------------
#> (Intercept) | 2.80e-16 | 0.02 | [-0.04, 0.04] | 1.31e-14 | > .999
#> xtra hours + n comps | 0.61 | 0.01 | [ 0.58, 0.64] | 41.09 | < .001
```

We can then compare how this model compares to our first model without this constraint:

`anova(m0, m)`

```
#> Analysis of Variance Table
#>
#> Model 1: salary ~ I(xtra_hours + n_comps)
#> Model 2: salary ~ xtra_hours + n_comps
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 498 113.662
#> 2 497 74.942 1 38.72 256.78 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We can conclude that the unconstrained model is significantly better than the constrained model - meaning that .

**Edit** (Feb-17, 2021): Thanks to @joejps84 for pointing this out!

We can achieve the same thing in a single model. If we say that the slope of `n_comps`

as equal to the slope of `xtra_hours`

+ some change:

So the slope of `n_comps`

in such a model is the difference between the slope of `n_comps`

and `xtra_hours`

:

```
m1 <- lm(salary ~ I(xtra_hours + n_comps) + n_comps, data = hardlyworkingZ)
model_parameters(m1)
```

```
#> Parameter | Coefficient | SE | 95% CI | t(497) | p
#> ------------------------------------------------------------------------------
#> (Intercept) | 2.56e-16 | 0.02 | [-0.03, 0.03] | 1.47e-14 | > .999
#> xtra hours + n comps | 0.81 | 0.02 | [ 0.78, 0.84] | 46.60 | < .001
#> n comps | -0.40 | 0.03 | [-0.45, -0.35] | -16.02 | < .001
```

This give the exact same result ( is the same as the *F*-value from the `anova()`

test)!

According to Paternoster et al. (1998), we can compute a *t*-test to compare the coefficients:

```
b <- coef(m)
V <- vcov(m)
tibble::tibble(
diff_estim = b[2] - b[3],
diff_SE = sqrt(V[2, 2] + V[3, 3] - 2 * V[2, 3]),
t_stat = diff_estim / diff_SE,
df = df.residual(m),
p_value = 2 * pt(abs(t_stat), df = df, lower.tail = FALSE)
)
```

```
#> # A tibble: 1 × 5
#> diff_estim diff_SE t_stat df p_value
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0.402 0.0251 16.0 497 6.83e-47
```

This gives the exact same results as the first method!

**Edit** (Feb-17, 2021): As @bmwiernik pointed out, this can also be done with some fancy matrix multiplication:

```
contr <- c(0, 1, -1)
tibble::tibble(
diff_estim = b %*% contr,
diff_SE = sqrt(contr %*% V %*% contr),
t_stat = diff_estim / diff_SE,
df = df.residual(m),
p_value = 2 * pt(abs(t_stat), df = df, lower.tail = FALSE)
)
```

```
#> # A tibble: 1 × 5
#> diff_estim[,1] diff_SE[,1] t_stat[,1] df p_value[,1]
#> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 0.402 0.0251 16.0 497 6.83e-47
```

All of the following solutions are essentially this method, wrapped in some nice functions.

`emmeans`

<3We can estimate the slopes from the model using `emmeans`

, and then, with some trickery, compare them!

```
library(emmeans)
trends <- rbind(
emtrends(m, ~1, "xtra_hours"),
emtrends(m, ~1, "n_comps")
)
# clean up so it does not error later
trends@grid$`1` <- c("xtra_hours", "n_comps")
trends@misc$estName <- "trend"
trends
```

```
#> 1 trend SE df lower.CL upper.CL
#> xtra_hours 0.811 0.0174 497 0.772 0.850
#> n_comps 0.409 0.0174 497 0.370 0.448
#>
#> Confidence level used: 0.95
#> Conf-level adjustment: bonferroni method for 2 estimates
```

Compare them:

`pairs(trends)`

```
#> contrast estimate SE df t.ratio p.value
#> xtra_hours - n_comps 0.402 0.0251 497 16.024 <.0001
```

Once again - exact same results!

`lavaan`

The `lavaan`

package for latent variable analysis and structural equation modeling allows for parameter constraining. So let’s do just that:

```
library(lavaan)
L <- sem("salary ~ xtra_hours + n_comps", data = hardlyworkingZ)
L0 <- sem("salary ~ a * xtra_hours + a * n_comps", data = hardlyworkingZ)
anova(L0, L)
```

```
#>
#> Chi-Squared Difference Test
#>
#> Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
#> L 0 475.99 488.63 0.00
#> L0 1 682.25 690.68 208.26 208.26 0.64383 1 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

This too yields similar results! (Only slightly different due to different estimation methods.)

We can also directly estimate the difference using `lavaan`

’s “defined parameters” - defines with the `:=`

operator:

```
L <- sem("salary ~ b1 * xtra_hours + b2 * n_comps
diff := b1 - b2",
data = hardlyworkingZ)
parameterEstimates(L, output = "text")[7,]
```

```
#>
#> Defined Parameters:
#> Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
#> diff 0.402 0.025 16.073 0.000 0.353 0.451
```

Which, again, gives the same results.

`car`

**Edit** (Feb-17, 2021): Thanks to @DouglasKGAraujo for his suggestion!

Even more methods!

```
library(car)
linearHypothesis(m, c("xtra_hours - n_comps"))
```

```
#> Linear hypothesis test
#>
#> Hypothesis:
#> xtra_hours - n_comps = 0
#>
#> Model 1: restricted model
#> Model 2: salary ~ xtra_hours + n_comps
#>
#> Res.Df RSS Df Sum of Sq F Pr(>F)
#> 1 498 113.662
#> 2 497 74.942 1 38.72 256.78 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

Which once again gives the same result!

`multcomp`

**Edit** (Feb-17, 2021): Thanks to Stefan Th. Gries for his suggestion!

Even more more methods!

```
library(multcomp)
summary(glht(m, matrix(c(0, 1, -1), nrow = 1)))
```

```
#>
#> Simultaneous Tests for General Linear Hypotheses
#>
#> Fit: lm(formula = salary ~ xtra_hours + n_comps, data = hardlyworkingZ)
#>
#> Linear Hypotheses:
#> Estimate Std. Error t value Pr(>|t|)
#> 1 == 0 0.40171 0.02507 16.02 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> (Adjusted p values reported -- single-step method)
```

Which once again again gives the same result!

That’s it really - these ~~4~~ 6 simple methods have wide applications to GL(M)M’s, SEM, and more.

Enjoy!

For our example, we will use some (fake) developmental-growth data - we have the height and weight of 500 individuals between the ages of 5 and 24. We can plot our tri-variate data like so:

This generally looks as one might expect - taller people weigh more, older people are taller and weigh more as well.

Let us fit an OLS linear model^{1} to predict weight from height and age:

```
fit <- lm(weight ~ height + age, data = data)
summary(fit)
```

```
#>
#> Call:
#> lm(formula = weight ~ height + age, data = data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -23.8960 -3.9637 -0.0438 4.3013 25.8607
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -51.97123 5.27409 -9.854 <2e-16 ***
#> height 0.53458 0.05741 9.312 <2e-16 ***
#> age 2.47319 0.25311 9.771 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 7.009 on 497 degrees of freedom
#> Multiple R-squared: 0.7856, Adjusted R-squared: 0.7847
#> F-statistic: 910.4 on 2 and 497 DF, p-value: < 2.2e-16
```

Unsurprisingly, both `age`

and `height`

are positively related to `weight`

.

Let’s explore our model - specifically, we want to see what weight can we expect for 3 individuals of the following heights: **120, 145, and 170cm**. We can do this with `emmeans`

:

```
library(emmeans)
em_by_height <- emmeans(fit, ~ height, at = list(height = c(120, 145, 170)))
em_by_height
```

```
#> height emmean SE df lower.CL upper.CL
#> 120 49.0 1.536 497 46.0 52.0
#> 145 62.4 0.321 497 61.7 63.0
#> 170 75.7 1.402 497 73.0 78.5
#>
#> Confidence level used: 0.95
```

We can also add these expected values to the plot from above:

Something seems off - the expected value for a person 120cm tall seems too high, while the expected value for a person 170cm tall seems too low. Why is that? What’s going on?

First, let’s talk about the elephant in the room - there is obvious multicollinearity between our predictors! If we’re not sure, we can use `performance::check_collinearity()`

to validate this:

`performance::check_collinearity(fit)`

```
#> # Check for Multicollinearity
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> height 5.26 [4.51, 6.17] 2.29 0.19 [0.16, 0.22]
#> age 5.26 [4.51, 6.17] 2.29 0.19 [0.16, 0.22]
```

But how is this related to the funky estimates we got?

Well, this has to do with what we do with the *other* predictors when exploring a single predictor. In our case, what happens to `age`

when we’re exploring the role of `height`

in our model?

The common and almost default approach is to fix `age`

to a constant. This is really what our model does in the first place: the coefficient of `height`

represents the expected change in `weight`

while `age`

is fixed and not allowed to vary. What constant? A natural candidate (and indeed `emmeans`

’ default) is the mean. In our case, the mean age is 14.9 years. So the expected values produced above are for three 14.9 year olds with different heights. But is this data plausible? If I told you I saw a person who was 120cm tall, would you also assume they were 14.9 years old?

No, you would not. And that is exactly what covariance and multicollinearity mean - that some combinations of predictors are more likely than others.

So how do we get more reasonable expected weights?

Well, we can allow `age`

to vary in our prediction. Vary how? Well, vary with height! So instead of asking *“what are the expected heights of 3 individuals that are 120, 145, and 170cm tall, all of the same age”*, we can ask *“what are the expected heights of 3 individuals that are 120, 145, and 170cm tall, of height-appropriate ages”*.

In `emmeans`

this can be done by specifying a predictive formula in `cov.reduce`

. For example:

```
(rg_multicov <- ref_grid(fit,
at = list(height = c(120, 145, 170)),
cov.reduce = list(age ~ height))) # This!
```

```
#> 'emmGrid' object with variables:
#> height = 120, 145, 170
#> age = (predicted by other variables)
```

`rg_multicov@grid`

```
#> height age .wgt.
#> 1 120 9.537649 1
#> 2 145 14.640867 1
#> 3 170 19.744085 1
```

`emmeans(rg_multicov, ~ height)`

```
#> height emmean SE df lower.CL upper.CL
#> 120 35.8 0.727 497 34.3 37.2
#> 145 61.8 0.315 497 61.1 62.4
#> 170 87.7 0.673 497 86.4 89.1
#>
#> Confidence level used: 0.95
```

We can see from the reference grid (`rg_multicov@grid`

) that `age`

is not fixed, but varies with `height`

.

We can also achieve the same this in a single call to `emmeans()`

:

```
em_by_height2 <- emmeans(fit, ~ height,
at = list(height = c(120,145,170)),
cov.red = list(age ~ height))
em_by_height2
```

```
#> height emmean SE df lower.CL upper.CL
#> 120 35.8 0.727 497 34.3 37.2
#> 145 61.8 0.315 497 61.1 62.4
#> 170 87.7 0.673 497 86.4 89.1
#>
#> Confidence level used: 0.95
```

What we’ve done is essentially re-introducing multicollinearity into our estimates by allowing `age`

to co-vary with `height`

: each expected value is no longer just a function of `height`

, but it is also a function of the expected `age`

for that `height`

^{2}.

If we plot *these* estimates, we get…

…which seems more reasonable.

Should you do this whenever you have multicollinearity in your data?

No - not always and not only!

You might want to do this whenever exploring one predictor while fixing other to a constant produces uncommon *and unlikely* combinations of predictors. In fact, such a situation may arise even when measures of multicollinearity (such as the VIF) are not “too” high, as these usually only measure co-*linearity*, but other patterns of co-variation my be present in your data. In fact this may be useful in ** any instance of co-variation** among predictors, such as

Note also that producing predictions in this manner does not actually “solve” any of the difficulties of interpretations or any of the oddities that arise in statistical suppression / confounding / mediation. (Extra) care should be taken when interpreting the parameters and the predictions of models with co-varying predictors. In this our case, the question of “*does weight vary more with age, or with height*” can’t cleanly be answered - what is the meaning of looking at the coefficient (or standardized coefficient) of one predictor while holding the other constant when we know they strongly co-vary?

```
knitr::opts_chunk$set(echo = FALSE,
message = FALSE,
comment = "#>")
library(emmeans)
library(ggplot2)
source('../msbblog_theme_pallets.R')
theme_set(theme_msbblog())
library(dplyr)
library(datawizard)
S <- diag(1,3,3)
S[1,2] <- S[2,1] <- 0.9 # add multicollinearity
set.seed(3)
data <- MASS::mvrnorm(500, rep(0,3), S, empirical = TRUE) %>%
data.frame() %>%
rename(height = X1,
age = X2,
e = X3) %>%
mutate(age = change_scale(age, to = c(5, 24)),
height = change_scale(height, to = c(108, 189)),
weight = 7 * (scale(height) + scale(age) + scale(age ^ 2 * e)) + 63)
base_plot <- ggplot(data, aes(height, weight, color = age)) +
geom_point(shape = 16, alpha = 0.5) +
scale_color_gradientn("Age [yrs]",
colours = msbblog_colors[c("blue", "purple", "red", "orange")],
values = c(0, 0.6, 1)) +
labs(y = "Weight [Kg]",
x = "Height [cm]")
base_plot
fit <- lm(weight ~ height + age, data = data)
summary(fit)
library(emmeans)
em_by_height <- emmeans(fit, ~ height, at = list(height = c(120, 145, 170)))
em_by_height
p_dat <- summary(em_by_height)
p_dat$age <- em_by_height@linfct[,"age"]
base_plot +
geom_point(data = p_dat,
aes(y = emmean, color = age),
shape = 24, size = 2, stroke = 2, fill = "black")
performance::check_collinearity(fit)
(rg_multicov <- ref_grid(fit,
at = list(height = c(120, 145, 170)),
cov.reduce = list(age ~ height))) # This!
rg_multicov@grid
emmeans(rg_multicov, ~ height)
em_by_height2 <- emmeans(fit, ~ height,
at = list(height = c(120,145,170)),
cov.red = list(age ~ height))
em_by_height2
p_dat <- summary(em_by_height2)
p_dat$age <- em_by_height2@linfct[,"age"]
base_plot +
geom_point(data = p_dat,
aes(y = emmean, color = age),
shape = 24, size = 2, stroke = 2, fill = "black")
```

Ignoring the obvious heteroscedasticity in the data.↩︎

Note that here the expected

`age`

is a*linear*function of`height`

, but this can really take any form:`age ~ poly(height,3)`

,`age ~ log(height)`

, etc.↩︎Thus it is also related to the idea of

*conditioning on a collider*, which is really “just” a case of co-varrying predictors with a specific causal structure. Here we are conditioning the association between weight and height on age, which has a similar statistical confounding effect. For more, read this post and comments from 100%CI.↩︎

Who doesn’t love GLMs? The ingenious idea of taking a response level variable (e.g. binary or count) and getting some link function magic to treat it as if it was our long-time friend, linear regression.

In the last few days, a preprint by McCabe *et al.* popped up in our twitter feed (recommended reading!) and re-focused our attention on the problem with interpreting effects and interactions within the GLM framework. McCabe *et al.* state in the abstract that:

“To date, typical practice in evaluating interaction effects in GLMs extends directly from linear approaches, in which the product term coefficient between variables of interest are used to provide evidence of an interaction effect. However, unlike linear models, interaction effects in GLMs are not equal to product terms between predictor variables and are instead a function of all predictors of a model.”

The what-now?

When fitting a GLM we think of the response level (the binary outcome or the counts that interest us), however, the model is fit (and often all statistical inferences are done) on the latent level - in the realm of the link function.

For interactions (but as you will see soon, not only), it means that when we test the effect of `X1`

at different levels of `X2`

, we’re testing these effects **on the latent (e.g., logistic) level, which might not represent these effects on the response level!**

For example, both plots above represent corresponding predictions from the same interaction-model - on the left we have those predictions on the latent level, and on the right these have been transformed back to the response level (the probability). We can see that at the latent level, the effect of `X1`

on y is very different at all levels of `X2`

, but at the response level these differences can shrink or possibly disappear (e.g., red vs. blue lines), or get larger (red and blue vs. purple line).

This is true regardless of whether or not an interaction was included in the model! And in fact, even main effects on the latent level do not always correspond to the response level the way we would have intuitively imagined.

What follows are 3 methods for testing interactions in GLMs, using `emmeans`

. Again, we highly recommend reading McCabe *et al.*’s original paper.

Let’s load up some packages:

```
library(emmeans) # 1.8.4.1
library(magrittr) # 2.0.3
```

The model used here is a logistic regression model, using data adapted from McCabe *et al.*, except we’re using a binomial outcome (see code for data generation at the end of the post):

```
y <- plogis(xb) < 0.01
model <- glm(y ~ x1 * female + x2,
data = df,
family = "binomial")
```

`emmeans`

for estimation / testingIf you’re not yet familiar with `emmeans`

, it is a package for estimating, testing, and plotting marginal and conditional means / effects from a variety of linear models, including GLMs.

So let’s answer the question:

Does the effect of sex (

`female`

) differ as a function of`x1`

, and how does this interaction differ as a function of`x2`

.

We will use the pick-a-point method for both continuous variables:

- For
`x1`

: -1, +1 - For
`x2`

: mean +- sd

```
emmeans(model, ~ x1 + female + x2,
at = list(x1 = c(-1, 1)),
cov.reduce = list(x2 = mean_sd)) %>%
contrast(interaction = c("pairwise", "pairwise"),
by = "x2")
```

```
x2 = -1.020:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 -1.17 0.497 Inf -2.350 0.0188
x2 = -0.007:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 -1.17 0.497 Inf -2.350 0.0188
x2 = 1.006:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 -1.17 0.497 Inf -2.350 0.0188
Results are given on the log odds ratio (not the response) scale.
```

It seems that on the latent level the (estimated) difference of differences (the interaction) between `female`

and `x1`

is unaffected by which level of `x2`

they are conditioned on. This makes sense - we did not model a 3-way interaction, so why should it? Everything is acting as expected.

Or is it? Well, that depends…

We can also try and answer the same question on the response level using the delta method (baked into `emmeans`

). Here we have two options for defining an “effect”:

- An effect is a difference in probabilities.
- An effect is a ratio of probabilities.

For this, we just need to add `trans = "response"`

in the call to `emmeans()`

:

```
emmeans(model, ~ x1 + female + x2,
at = list(x1 = c(-1, 1)),
cov.reduce = list(x2 = mean_sd),
trans = "response") %>%
contrast(interaction = c("pairwise", "pairwise"),
by = "x2")
```

```
x2 = -1.020:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 -0.0265 0.0787 Inf -0.336 0.7365
x2 = -0.007:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 0.0976 0.0470 Inf 2.075 0.0380
x2 = 1.006:
x1_pairwise female_pairwise estimate SE df z.ratio p.value
(-1) - 1 0 - 1 0.0371 0.0163 Inf 2.279 0.0227
```

```
# Difference of differences when x2 = -1.02
(0.69457 - 0.44178) - (0.32986 - 0.05059)
```

`[1] -0.02648`

It seems that on the response level, we get different results than on the latent level. And not only that, but even though the model did not include a 3-way interaction, the 2-way `female:x1`

interaction is conditional on the level of `x2`

- changing in size as a function of `x2`

, and is not significant in low levels of `x2`

!

(Also called *risk ratios*.)

For this, we just need to add `trans = "log"`

and `type = "response"`

in the call to `emmeans()`

:

```
emmeans(model, ~ x1 + female + x2,
at = list(x1 = c(-1, 1)),
cov.reduce = list(x2 = mean_sd),
trans = "log",
type = "response") %>%
contrast(interaction = c("pairwise", "pairwise"),
by = "x2")
```

```
x2 = -1.02:
x1_pairwise female_pairwise ratio SE df null z.ratio p.value
(-1) / 1 0 / 1 0.241 0.097 Inf 1 -3.534 0.0004
x2 = -0.007:
x1_pairwise female_pairwise ratio SE df null z.ratio p.value
(-1) / 1 0 / 1 0.268 0.124 Inf 1 -2.856 0.0043
x2 = 1.006:
x1_pairwise female_pairwise ratio SE df null z.ratio p.value
(-1) / 1 0 / 1 0.299 0.146 Inf 1 -2.480 0.0131
Tests are performed on the log scale
```

```
# Ratio of ratios when x2 = -1.02
(0.69457 / 0.44178) / (0.32986 / 0.05059)
```

`[1] 0.2411265`

It seems that even on the response level, different delta methods produce different results!^{1} Although we maintain the finding that the size of the 2-way `female:x1`

interaction is conditional on the level of `x2`

, here it *decreases* as a function of `x2`

(but is significant across all (tested) values of `x2`

)!

(Note: as we are interested in the slope of `x1`

, we could have used `emtrends`

instead of `emmeans`

. See code at the end of the post for what that would look like.)

As stated above, this is not only an issue of interactions. For example, when looking at the response level, the effect of `x2`

is itself conditional on the value of `x2`

!

```
emmeans(model, ~ x2,
cov.reduce = list(x2 = mean_sd),
trans = "response") %>%
contrast(method = "consec")
```

```
contrast estimate SE df z.ratio p.value
(x2-0.007) - (x2-1.02) -0.2361 0.02190 Inf -10.782 <.0001
x21.006 - (x2-0.007) -0.0965 0.00825 Inf -11.695 <.0001
Results are averaged over the levels of: female
P value adjustment: mvt method for 2 tests
```

Note how the estimated effect of `x2`

is smaller for larger values of `x2`

!

The growing popularity of GLMs (and GLMMs) in social research seems to come with another source of researcher degrees of freedom (and we all know how well that works for us)…

What should you do?

Honestly, we don’t know. Some of us feel that since the response variable is our variable of interest, that’s what we should be focusing on; some of us feel that with no common practice, we should stick to the latent level; some of us are agnostic (that covers all of us). We can’t recommend one approach, but we do think that when fitting and working with GLMs, this is a consideration one has to face.^{2} Good luck!

```
set.seed(1678)
b0 <- -3.8 # Intercept
b1 <- .35 # X1 Effect
b2 <- .9 # X2 Effect
b3 <- 1.1 # Sex covariate effect
b13 <- .2 # product term coefficient
n <- 1000 # Sample Size
mu <- rep(0, 2) # Specify means
# Specify covariance matrix
S <- matrix(c(1, .5, .5, 1),
nrow = 2, ncol = 2)
sigma <- 1 # Level 1 error
# simulates our continuous predictors from a multivariate
# normal distribution
rawvars <- MASS::mvrnorm(n = n, mu = mu, Sigma = S)
cat <- rbinom(n = n, 1, .5)
id <- seq(1:n)
eij <- rep(rnorm(id, 0, sigma))
xb <- (b0) +
(b1) * (rawvars[, 1]) +
(b2) * (rawvars[, 2]) +
(b3) * cat +
b13 * cat * (rawvars[, 1]) +
eij
df <- data.frame(x1 = rawvars[, 1],
x2 = rawvars[, 2],
female = cat)
y <- plogis(xb) < 0.01
model <- glm(y ~ x1 * female + x2,
data = df,
family = "binomial")
```

`emtrends`

Note that the inferential results ( and values) are similar (though not identical) to those obtained using `emmeans`

.

```
# log(odds)
emtrends(model, ~ female + x2, "x1",
cov.reduce = list(x2 = mean_sd)) %>%
contrast(method = "pairwise", by = "x2")
```

```
x2 = -1.020:
contrast estimate SE df z.ratio p.value
female0 - female1 0.584 0.248 Inf 2.350 0.0188
x2 = -0.007:
contrast estimate SE df z.ratio p.value
female0 - female1 0.584 0.248 Inf 2.350 0.0188
x2 = 1.006:
contrast estimate SE df z.ratio p.value
female0 - female1 0.584 0.248 Inf 2.350 0.0188
```

```
# diffs
emtrends(model, ~ female + x2, "x1",
cov.reduce = list(x2 = mean_sd),
trans = "response") %>%
contrast(method = "pairwise", by = "x2")
```

```
x2 = -1.020:
contrast estimate SE df z.ratio p.value
female0 - female1 0.0107 0.04117 Inf 0.259 0.7957
x2 = -0.007:
contrast estimate SE df z.ratio p.value
female0 - female1 -0.0542 0.02399 Inf -2.259 0.0239
x2 = 1.006:
contrast estimate SE df z.ratio p.value
female0 - female1 -0.0195 0.00793 Inf -2.457 0.0140
```

```
# ratios
emtrends(model, ~ female + x2, "x1",
cov.reduce = list(x2 = mean_sd),
trans = "log", type = "response") %>%
contrast(method = "pairwise", by = "x2")
```

```
x2 = -1.020:
contrast estimate SE df z.ratio p.value
female0 - female1 0.727 0.207 Inf 3.509 0.0004
x2 = -0.007:
contrast estimate SE df z.ratio p.value
female0 - female1 0.663 0.233 Inf 2.848 0.0044
x2 = 1.006:
contrast estimate SE df z.ratio p.value
female0 - female1 0.605 0.244 Inf 2.476 0.0133
```

Note that for Poisson models with a “log” link function, this is the same as working on the latent level!↩︎

But hey, whatever you do - don’t model binary / count data with a linear model, okay?↩︎

Experimental psychology is moving away from repeated-measures-ANOVAs, and towards linear mixed models (LMM^{1}). LMMs have many advantages over rmANOVA, including (but not limited to):

- Analysis of single trial data (as opposed to aggregated means per condition).
- Specifying more than one random factor (typically crossed random intercepts of
*subject*and*item*). - The use of continuous variables as predictors.

~~Making you look like you know what you’re doing.~~

~~Defeating the un-dead / reviewer 2.~~

- The ability to specify custom models.
^{2}

This post will focus on this last point. Specifically, why you should always include main-effects when modeling interactions, and what happens if you don’t (** spooky**).

Say you’ve finally won that grant you submitted to study candy consumption during ghostly themed holidays. As part of your first study, you decide to measure the effects of *costume type* (scary / cute) and level of neighborhood *decor* (high / low levels of house decorations) on the total weight of collected candy (in Kg). A simple, yet informative 2-by-2 design.

Being the serious scientist you are, you have several hypotheses:

**A main effect for**- neighborhoods with more decorations will overall give out more candy.*decor level***No main effect for**- overall, children with cute and scary costumes will receive the same amount of candy (in Kg).*costume***A**- high*decor level**costume*interaction*decor*neighborhoods will favor scary costumes, while low*decor*neighborhoods will favor cute costumes.

It would only make sense to specify your statistical model accordingly - after all, why shouldn’t your model represent your hypotheses?

In R, such a model is described as `candy_kg ~ decor + decor:costume`

, instructing R to model `candy_kg`

as a function of the effect for `decor`

+ the interaction `decor:costume`

.

And so, you fit the model:

```
options(contrasts = c('contr.sum', 'contr.poly')) # set effects coding (just once)
fit <- aov(candy_kg ~ decor + decor:costume, data = spooky_data)
```

Term | df | SS | MS | F | p-value | ||
---|---|---|---|---|---|---|---|

decor | decor | 1 | 30.00 | 30.00 | 23.64 | <0.001 | 0.10 |

decor:costume | decor:costume | 2 | 120.00 | 60.00 | 47.28 | <0.001 | 0.40 |

Residuals | Residuals | 116 | 147.20 | 1.27 |

As predicted, you find both a significant main effect for *decor* and the interaction *decor* *costume*, with the interaction explaining 40% of the variance in collected candy weight. So far so good - your results reflect your hypotheses!

But then you plot your data, and to your horror you find…

It looks like there is no interaction *at all*! Your interaction was nothing more than a ghost! An apparition! How is this possible?? Where has all of variance explained by it gone???

In fact, had you fit the full model, you would have found:

`fit <- aov(candy_kg ~ decor * costume, data = spooky_data)`

Term | df | SS | MS | F | p-value | ||
---|---|---|---|---|---|---|---|

decor | decor | 1 | 30.00 | 30.00 | 23.64 | <0.001 | 0.10 |

costume | costume | 1 | 120.00 | 120.00 | 94.56 | <0.001 | 0.40 |

decor:costume | decor:costume | 1 | 0.00 | 0.00 | 0.00 | >0.999 | 0.00 |

Residuals | Residuals | 116 | 147.20 | 1.27 |

The interaction actually explains 0% of the variance! And the effect of *costume* is the one that explains 40% of the variance!^{3} How could this be?? Have we angered Fisher’s spirit somehow?

What happened was that because we did not account for *costume* in our model, the variance explained by *costume* was swallowed by the interaction *decor* *costume*!

_{If you find math too scary, feel free to skip to conclusion.}

Travel back to *Intro to Stats*, and recall that the interaction’s sum-of-squares - - is calculated as:

This is a simplification of the following equation:

Where represents the main effect for and represents the main effect for . We can see that represents the ** deviation from the additive model** - i.e., it is the degree by which the observed cells’ means deviate from what would be expected if there were only the two main effects.

When we exclude the main effect of from out model, we are telling our model that there is no need to estimate the main effect. That is, we set . The resulting is computed not as above, but as:

This formula represents the degree by which the observed cells’ means deviate from what would be expected if there was only the main effect of . But now if the cells’ means deviate in a way that would otherwise have been part of a main effect for , the cells’ deviations from the main effect for will now include the deviations that would otherwise have been accounted for by a main effect of ! This results in the main effect for essentially getting “pooled” into . Furthermore, had you also excluded a main effect for , this effect too would have been “pooled” into the so-called interaction.

In other words:

When we don’t estimate (model) main effects,

we change the meaning of interactions- they no longer represents a deviation from the additive model.

Sure, you can specify a model with no main effect and only interactions, but in such a case the interactions no longer mean what we expect them to mean. If we want interactions to represent deviation from the additive model, *our model must also include the additive model!*

For simplicity’s sake, this example has focused on a simple 2-by-2 between subject design, but the conclusions drawn here are relevant for any design in which a factor interacts with or moderates the effect of another factor or continuous variable.

```
library(tidyverse)
M <- c(1,2,3,4) + 1 # 4 means with no interaction
n <- 30
e <- scale(rnorm(n),center = TRUE, scale = FALSE) # some error
spooky_data <- expand.grid(decor = c("Low","High"),
costume = c("Scary","Cute")) %>%
mutate(candy_kg = map(M, ~ .x + e)) %>%
unnest(cols = candy_kg)
```

Or hierarchical linear models (HLM)… or mixed linear models (MLM)…↩︎

Whereas in an AVONA analysis with 4 factors you always have: Four main effects + Six 2-way interaction + Four 3-way interaction + One 4-way interaction.↩︎

Note also that the is the same for both models, indicating the same number of parameters overall have been estimated in both. E.g., while in the full model we would have 3 parameters - one for each main effect + one for the interaction, in the misspecified model we have one for the main effect, and two for the interaction. That is, no matter how you tell the model to split the s, the number of parameters needed to model 4 cells will always be 3.↩︎

In this post I will demonstrate:

- How to model
**both**group-level processes and individual-level processes from individual-level data, using linear-mixed models. - When you shouldn’t worry about group-level processes differing from individual-level processes. (controlled experiments for the win!)

Let’s work with the now-classic typing speed example. We take a group of 5 typists, and measure the speed of their typing (words per minute), and the rate of typing errors (errors per 100-words). Looking at the data we might get something like this:

```
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
```

As we can see, we have two sources of variation that can be used to explain or predict the rate of errors:

- Overall, faster typists make less mistakes (group-level pattern).

- When typing faster, typists make more mistakes (individual-level pattern).

We can model these using liner mixed models, but first we need to *split* our predictor (*speed*) into two variables, each representing a different source of variance - each typist’s average typing speed, and the deviation of each measurement from the typist’s overall mean:^{2}

```
library(dplyr)
data <- data %>%
group_by(ID) %>%
mutate(speed_M = mean(speed),
speed_E = speed - speed_M) %>%
ungroup()
head(data)
```

```
# A tibble: 6 × 5
ID speed errors speed_M speed_E
<int> <dbl> <dbl> <dbl> <dbl>
1 1 -0.773 -1.74 -0.188 -0.585
2 1 -0.144 -0.703 -0.188 0.0438
3 1 -0.686 -1.73 -0.188 -0.498
4 1 0.560 1.17 -0.188 0.748
5 1 0.214 0.316 -0.188 0.402
6 1 0.179 0.392 -0.188 0.367
```

Let’s fit a liner mixed model and see how we can detect both patterns correctly.

```
library(lmerTest)
fit <- lmer(errors ~ speed_M + speed_E + (1 + speed_E | ID),
data = data)
```

```
Your model may suffer from singularity (see see `?lme4::isSingular` and
`?performance::check_singularity`).
Some of the standard errors and confidence intervals of the random
effects parameters are probably not meaningful!
```

Parameter | Coefficient | SE | CI | CI_low | CI_high | t | df_error | p | Effects | Group |
---|---|---|---|---|---|---|---|---|---|---|

(Intercept) | 0.000 | 0.620 | 0.95 | -1.225 | 1.225 | 0.000 | 143 | 1.000 | fixed | |

speed_M | -1.600 | 0.693 | 0.95 | -2.969 | -0.231 | -2.309 | 143 | 0.022 | fixed | |

speed_E | 1.400 | 0.119 | 0.95 | 1.165 | 1.635 | 11.762 | 143 | 0.000 | fixed | |

SD (Intercept) | 1.379 | 0.568 | 0.95 | 0.615 | 3.093 | NA | NA | NA | random | ID |

SD (speed_E) | 0.000 | 49027.294 | 0.95 | 0.000 | Inf | NA | NA | NA | random | ID |

Cor (Intercept~speed_E) | 1.000 | 297941.433 | 0.95 | NaN | 1.000 | NA | NA | NA | random | ID |

SD (Observations) | 0.717 | 0.043 | 0.95 | 0.637 | 0.806 | NA | NA | NA | random | Residual |

As we can see, the slope for `speed_M`

is negative (-1.6), reflecting the group-level pattern where typists who are overall faster have fewer errors; whereas the slope for `speed_E`

is positive (1.4), reflecting the individual-level pattern where faster typing leads to more errors.

Experiments!

Or to be more precise, when *we* control the values of the independent variable. Why is this so? Because we control the values of the independent variable, the independent variable cannot be split into different sources of variance: there is either variance between subjects (the variable is manipulated in a between-subjects design) or there is variance within subjects (the variable is manipulated in a within-subjects design), but never both. Thus, although there can be huge heterogeneity in the way subjects present an effect, the average individual-level effect will be the same as the group-level effect (depending on the design).^{3}

and the short twitter discussion that followed.↩︎

Read more in: Hoffman, L. (2015). Time-varying predictors in models of within-person fluctuation. In

*Longitudinal analysis: Modeling within-person fluctuation and change*(pp. 327-392). Routledge.↩︎Ignoring any differences or artifacts that may arise from the differences in the design itself, such as order effects, etc.↩︎

`lmer`

```
library(lme4)
data(obk.long, package = "afex") # data from the afex package
fit_mixed <- lmer(value ~ treatment * gender * phase * hour + (1|id),
data = obk.long)
```

Note that I assume here data is aggregated (one value per cell/subject), as it would be in a rmANOVA, as so it is sufficient to model only a random intercept.

For this step we will be using `emmeans`

to get the estimates of the pairwise differences between the treatment groups within each phase of the study:

```
library(emmeans)
# get the correct reference grid with the correct ultivariate levels!
rg <- ref_grid(fit_mixed, mult.levs = rm_levels)
rg
```

```
'emmGrid' object with variables:
treatment = control, A, B
gender = F, M
phase = fup, post, pre
hour = 1, 2, 3, 4, 5
```

```
# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)
```

`NOTE: Results may be misleading due to involvement in interactions`

`em_`

```
phase treatment emmean SE df lower.CL upper.CL
fup control 4.33 0.603 13.2 3.03 5.64
post control 4.08 0.603 13.2 2.78 5.39
pre control 4.25 0.603 13.2 2.95 5.55
fup A 7.25 0.661 13.2 5.82 8.68
post A 6.50 0.661 13.2 5.07 7.93
pre A 5.00 0.661 13.2 3.57 6.43
fup B 7.29 0.505 13.2 6.20 8.38
post B 6.62 0.505 13.2 5.54 7.71
pre B 4.17 0.505 13.2 3.08 5.26
Results are averaged over the levels of: gender, hour
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
```

```
# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')
c_
```

```
phase = fup:
contrast estimate SE df t.ratio p.value
control - A -2.9167 0.895 13.2 -3.259 0.0157
control - B -2.9583 0.787 13.2 -3.760 0.0061
A - B -0.0417 0.832 13.2 -0.050 0.9986
phase = post:
contrast estimate SE df t.ratio p.value
control - A -2.4167 0.895 13.2 -2.700 0.0445
control - B -2.5417 0.787 13.2 -3.230 0.0166
A - B -0.1250 0.832 13.2 -0.150 0.9876
phase = pre:
contrast estimate SE df t.ratio p.value
control - A -0.7500 0.895 13.2 -0.838 0.6869
control - B 0.0833 0.787 13.2 0.106 0.9938
A - B 0.8333 0.832 13.2 1.002 0.5885
Results are averaged over the levels of: gender, hour
Degrees-of-freedom method: kenward-roger
P value adjustment: tukey method for comparing a family of 3 estimates
```

```
# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"pre: control - A", "pre: control - B", "pre: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
```

```
fup: control - A fup: control - B fup: A - B post: control - A
-2.91666667 -2.95833333 -0.04166667 -2.41666667
post: control - B post: A - B pre: control - A pre: control - B
-2.54166667 -0.12500000 -0.75000000 0.08333333
pre: A - B
0.83333333
```

Now let’s wrap this all in a function that accepts the fitted model as an argument:

```
treatment_phase_contrasts <- function(mod){
rg <- ref_grid(mod, mult.levs = rm_levels)
# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)
# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')
# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"pre: control - A", "pre: control - B", "pre: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
}
# test it
treatment_phase_contrasts(fit_mixed)
```

`NOTE: Results may be misleading due to involvement in interactions`

```
fup: control - A fup: control - B fup: A - B post: control - A
-2.91666667 -2.95833333 -0.04166667 -2.41666667
post: control - B post: A - B pre: control - A pre: control - B
-2.54166667 -0.12500000 -0.75000000 0.08333333
pre: A - B
0.83333333
```

Finally, we will use `lme4::bootMer`

to get the bootstrapped estimates!

```
treatment_phase_results <-
bootMer(fit_mixed, treatment_phase_contrasts, nsim = 50) # R = 599 at least
```

`NOTE: Results may be misleading due to involvement in interactions`

`summary(treatment_phase_results) # original vs. bootstrapped estimate (bootMed)`

```
Number of bootstrap replications R = 50
original bootBias bootSE bootMed
fup: control - A -2.916667 0.017263 0.77841 -2.801902
fup: control - B -2.958333 -0.017880 0.86119 -3.025705
fup: A - B -0.041667 -0.035143 0.98850 -0.066474
post: control - A -2.416667 0.031072 0.82654 -2.383370
post: control - B -2.541667 -0.024860 0.82351 -2.520263
post: A - B -0.125000 -0.055932 1.03670 -0.216929
pre: control - A -0.750000 -0.065397 0.73276 -0.851533
pre: control - B 0.083333 0.024664 0.78592 0.111930
pre: A - B 0.833333 0.090061 0.95015 0.994195
```

`confint(treatment_phase_results, type = "perc") # does include zero?`

```
2.5 % 97.5 %
fup: control - A -5.062951 -1.2782764
fup: control - B -4.985715 -1.0325666
fup: A - B -2.348035 2.1660820
post: control - A -4.451445 -0.5162071
post: control - B -4.840519 -1.1705024
post: A - B -2.349137 2.3025369
pre: control - A -2.427992 0.8830127
pre: control - B -1.915388 1.7159931
pre: A - B -1.530049 2.7527436
```

Results indicate that the Control group is lower than both treatment groups in the post and fup (follow -up) phases.

If we wanted p-values, we could use this little function (based on this demo):

```
boot_pvalues <- function(x, side = c(0, -1, 1)) {
# Based on:
# https://blogs.sas.com/content/iml/2011/11/02/how-to-compute-p-values-for-a-bootstrap-distribution.html
side <- side[1]
x <- as.data.frame(x$t)
ps <- sapply(x, function(.x) {
s <- na.omit(.x)
s0 <- 0
N <- length(s)
if (side == 0) {
min((1 + sum(s >= s0)) / (N + 1),
(1 + sum(s <= s0)) / (N + 1)) * 2
} else if (side < 0) {
(1 + sum(s <= s0)) / (N + 1)
} else if (side > 0) {
(1 + sum(s >= s0)) / (N + 1)
}
})
setNames(ps,colnames(x))
}
boot_pvalues(treatment_phase_results)
```

```
fup: control - A fup: control - B fup: A - B post: control - A
0.03921569 0.03921569 0.94117647 0.03921569
post: control - B post: A - B pre: control - A pre: control - B
0.03921569 0.74509804 0.23529412 0.94117647
pre: A - B
0.27450980
```

These p-values can then be passed to `p.adjust()`

for the p-value adjustment method of your choosing.

I’ve demonstrated (again!) how to run permutation tests on main effects / interactions, with follow-up analysis using the bootstrap method. Using this code as a basis for any analysis you might have in mind gives you all the flexibility of `emmeans`

, which supports many (many) models!

Here is a short simulation-proof for this equality.

For this simulations we will need the following packages:

```
# For plotting
library(ggplot2)
# For extracting SDT parameters
library(neuropsychology)
```

We will also need to make sure, for the logistic regression analysis, that our factors’ dummy coding is set to effects-coding - otherwise the intercept’s meaning will not correspond to the criterion (aka the *overall* response bias):

`options(contrasts = c('contr.sum', 'contr.poly'))`

```
n <- 100L
B <- 100L
```

We’ll run 100 simulations with 100 trials each.

```
set.seed(1)
SDT_params <- function(state,resp) {
tab <- table(state,resp)
sdt_res <- dprime(
n_hit = tab[2,2],
n_miss = tab[2,1],
n_fa = tab[1,2],
n_cr = tab[1,1]
)
c(sdt_res$dprime , sdt_res$c)
}
logistic_reg_params <- function(state,resp){
fit <- glm(resp ~ state, family = binomial())
coef(fit)
}
# initialize
res <- data.frame(d_ = numeric(B),
c_ = numeric(B),
int = numeric(B),
slope = numeric(B))
# Loop
for (b in seq_len(B)) {
true_sensitivity <- rexp(1,10) # random
true_criterion <- runif(1,-1,1) # random
# true state vector
state_i <- rep(c(F,T), each = n/2)
# response vector
Xn <- rnorm(n/2) # noise dist
Xs <- rnorm(n/2, mean = true_sensitivity) # signal + noise dist
X <- c(Xn,Xs)
resp_i <- X > true_criterion
# SDT params
res[b,1:2] <- SDT_params(state_i,resp_i)
# logistic regression params
res[b,3:4] <- logistic_reg_params(state_i,resp_i)
}
```

SDT parameters are on a standardized normal scale, meaning they are scaled to . However, the logistic distribution’s scale is . Thus, to convert the logistic regression’s parameters to the SDT’s we need to scale both the *intercept* and the *slope* by to have them on the same scale as and .^{1} Additionally,

- The slope must be also scaled by due to R’s default effects coding.

- The intercept must also be scaled by - see paper for the full rationale.

The red-dashed line represents the expected regression line predicting the SDT parameters from their logistic counterparts:

(The blue line is the empirical regression line.)

It is also possible to extend this equality to multi-level designs with generalized linear mixed models (GLMM; see chapters 3 and 9 in *Modeling Psychophysical Data in R*)^{2}, but I see no reason this wouldn’t be possible… One could model random effects per subject, and the moderating effect of some on sensitivity could in theory be modeled by including an interaction between and *state*; similarly, the moderating effect of on the criterion can be modeled by including a main effect for (moderation the intercept).

Note that there is no need for this scaling if the

`probit`

link is used instead of the`logit`

function. This demo has focused on the`logit`

link only because it is the more popular option (insert “old man yells at cloud”-gif here).↩︎Thanks to Kenneth Knoblauch for reaching out!↩︎

In this post I will demonstrate how to run a permutation test ANOVA (easy!) and how to run bootstrap follow-up analysis (a bit more challenging) in a mixed design (both within- and between-subject factors) ANOVA. I’ve chosen to use a mixed design model in this demonstration for two reasons:

- I’ve never seen this done before.
- You can easily modify this code (change / skip some of these steps) to accommodate purely within- or purely between-subject designs.

Running a permutation test for your ANOVA in R is as easy as… running an ANOVA in R, but substituting `aov`

with `aovperm`

from the `permuco`

package.

```
library(permuco)
data(obk.long, package = "afex") # data from the afex package
# permutation anova
fit_mixed_p <-
aovperm(value ~ treatment * gender * phase * hour + Error(id / (phase * hour)),
data = obk.long)
```

```
Warning in checkBalancedData(fixed_formula = formula_f, data = cbind(y, : The
data are not balanced, the results may not be exact.
```

`fit_mixed_p`

term | SSn | dfn | SSd | dfd | MSEn | MSEd | F | parametric P(>F) | resampled P(>F) | |
---|---|---|---|---|---|---|---|---|---|---|

treatment | treatment | 179.73 | 2 | 228.06 | 10 | 89.87 | 22.81 | 3.94 | 0.055 | 0.055 |

gender | gender | 83.45 | 1 | 228.06 | 10 | 83.45 | 22.81 | 3.66 | 0.085 | 0.082 |

treatment:gender | treatment:gender | 130.24 | 2 | 228.06 | 10 | 65.12 | 22.81 | 2.86 | 0.104 | 0.104 |

phase | phase | 129.51 | 2 | 80.28 | 20 | 64.76 | 4.01 | 16.13 | <0.001 | <0.001 |

treatment:phase | treatment:phase | 77.89 | 4 | 80.28 | 20 | 19.47 | 4.01 | 4.85 | 0.007 | 0.009 |

gender:phase | gender:phase | 2.27 | 2 | 80.28 | 20 | 1.14 | 4.01 | 0.28 | 0.757 | 0.765 |

treatment:gender:phase | treatment:gender:phase | 10.22 | 4 | 80.28 | 20 | 2.56 | 4.01 | 0.64 | 0.642 | 0.641 |

hour | hour | 104.29 | 4 | 62.50 | 40 | 26.07 | 1.56 | 16.69 | <0.001 | <0.001 |

treatment:hour | treatment:hour | 1.17 | 8 | 62.50 | 40 | 0.15 | 1.56 | 0.09 | >0.999 | >0.999 |

gender:hour | gender:hour | 2.81 | 4 | 62.50 | 40 | 0.70 | 1.56 | 0.45 | 0.772 | 0.772 |

treatment:gender:hour | treatment:gender:hour | 7.76 | 8 | 62.50 | 40 | 0.97 | 1.56 | 0.62 | 0.755 | 0.755 |

phase:hour | phase:hour | 11.35 | 8 | 96.17 | 80 | 1.42 | 1.20 | 1.18 | 0.322 | 0.319 |

treatment:phase:hour | treatment:phase:hour | 6.64 | 16 | 96.17 | 80 | 0.42 | 1.20 | 0.35 | 0.990 | 0.990 |

gender:phase:hour | gender:phase:hour | 8.96 | 8 | 96.17 | 80 | 1.12 | 1.20 | 0.93 | 0.496 | 0.498 |

treatment:gender:phase:hour | treatment:gender:phase:hour | 14.15 | 16 | 96.17 | 80 | 0.88 | 1.20 | 0.74 | 0.750 | 0.753 |

The results of the permutation test suggest an interaction between Treatment (a between subject factor) and Phase (a within-subject factor). To fully understand this interaction, we would like to conduct some sort of follow-up analysis (planned comparisons or post hoc tests). Usually I would pass the model to `emmeans`

for any follow-ups, but here, due to our assumption violations, we need some sort of bootstrapping method.

`car`

and `emmeans`

For bootstrapping we will be using the `Boot`

function from the `car`

package. In the case of within-subject factors, this function requires that they be specified in a multivariate data structure. Let’s see how this is done.

```
library(dplyr)
library(tidyr)
obk_mixed_wide <- obk.long %>%
unite("cond", phase, hour) %>%
spread(cond, value)
head(obk_mixed_wide)
```

```
id treatment gender age fup_1 fup_2 fup_3 fup_4 fup_5 post_1 post_2 post_3
1 1 control M -4.75 2 3 2 4 4 3 2 5
2 2 control M -2.75 4 5 6 4 1 2 2 3
3 3 control M 1.25 7 6 9 7 6 4 5 7
4 4 control F 7.25 4 4 5 3 4 2 2 3
5 5 control F -5.75 4 3 6 4 3 6 7 8
6 6 A M 7.25 9 10 11 9 6 9 9 10
post_4 post_5 pre_1 pre_2 pre_3 pre_4 pre_5
1 3 2 1 2 4 2 1
2 5 3 4 4 5 3 4
3 5 4 5 6 5 7 7
4 5 3 5 4 7 5 4
5 6 3 3 4 6 4 3
6 8 9 7 8 7 9 9
```

This is not enough, as we *also* need our new columns (representing the different levels of the within subject factors) to be in a matrix column.

```
obk_mixed_matrixDV <- obk_mixed_wide %>%
select(id, age, treatment, gender)
obk_mixed_matrixDV$M <- obk_mixed_wide %>%
select(-id, -age, -treatment, -gender) %>%
as.matrix()
glimpse(obk_mixed_matrixDV)
```

```
Rows: 16
Columns: 5
$ id <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
$ age <dbl> -4.75, -2.75, 1.25, 7.25, -5.75, 7.25, 8.25, 2.25, 2.25, -7.…
$ treatment <fct> control, control, control, control, control, A, A, A, A, B, …
$ gender <fct> M, M, M, F, F, M, M, F, F, M, M, M, F, F, F, F
$ M <dbl[,15]> <matrix[16 x 15]>
```

`fit_mixed <- aov(M ~ treatment * gender, obk_mixed_matrixDV)`

Note that the left-hand-side of the formula (the `M`

) is a matrix representing all the within-subject factors and their levels, and the right-hand-side of the formula (`treatment * gender`

) includes only the between-subject factors.

For this step we will be using `emmeans`

. But first, we need to create a list of the within-subject factors and their levels (I did say this was difficult - bear with me!). This list needs to correspond to the order of the multi-variate column in our data, such that if there is more than one factor, the combinations of factor levels are used in expand.grid order. In our case:

`colnames(obk_mixed_matrixDV$M)`

```
[1] "fup_1" "fup_2" "fup_3" "fup_4" "fup_5" "post_1" "post_2" "post_3"
[9] "post_4" "post_5" "pre_1" "pre_2" "pre_3" "pre_4" "pre_5"
```

```
rm_levels <- list(hour = c("1", "2", "3", "4", "5"),
phase = c("fup", "post", "pre"))
```

Make sure you get the order of the variables and their levels correct! This will affect your results!

Let’s use `emmeans`

to get the estimates of the pairwise differences between the treatment groups within each phase of the study:

```
library(emmeans)
# get the correct reference grid with the correct ultivariate levels!
rg <- ref_grid(fit_mixed, mult.levs = rm_levels)
rg
```

```
'emmGrid' object with variables:
treatment = control, A, B
gender = F, M
hour = multivariate response levels: 1, 2, 3, 4, 5
phase = multivariate response levels: fup, post, pre
```

```
# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)
em_
```

```
phase treatment emmean SE df lower.CL upper.CL
fup control 4.33 0.551 10 3.11 5.56
post control 4.08 0.628 10 2.68 5.48
pre control 4.25 0.766 10 2.54 5.96
fup A 7.25 0.604 10 5.90 8.60
post A 6.50 0.688 10 4.97 8.03
pre A 5.00 0.839 10 3.13 6.87
fup B 7.29 0.461 10 6.26 8.32
post B 6.62 0.525 10 5.45 7.80
pre B 4.17 0.641 10 2.74 5.59
Results are averaged over the levels of: gender, hour
Confidence level used: 0.95
```

```
# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')
c_
```

```
phase = fup:
contrast estimate SE df t.ratio p.value
control - A -2.9167 0.818 10 -3.568 0.0129
control - B -2.9583 0.719 10 -4.116 0.0054
A - B -0.0417 0.760 10 -0.055 0.9983
phase = post:
contrast estimate SE df t.ratio p.value
control - A -2.4167 0.931 10 -2.595 0.0634
control - B -2.5417 0.819 10 -3.105 0.0275
A - B -0.1250 0.865 10 -0.144 0.9886
phase = pre:
contrast estimate SE df t.ratio p.value
control - A -0.7500 1.136 10 -0.660 0.7911
control - B 0.0833 0.999 10 0.083 0.9962
A - B 0.8333 1.056 10 0.789 0.7177
Results are averaged over the levels of: gender, hour
P value adjustment: tukey method for comparing a family of 3 estimates
```

```
# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"pre: control - A", "pre: control - B", "pre: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
```

```
fup: control - A fup: control - B fup: A - B post: control - A
-2.91666667 -2.95833333 -0.04166667 -2.41666667
post: control - B post: A - B pre: control - A pre: control - B
-2.54166667 -0.12500000 -0.75000000 0.08333333
pre: A - B
0.83333333
```

Now let’s wrap this all in a function that accepts the fitted model as an argument:

```
treatment_phase_contrasts <- function(mod){
rg <- ref_grid(mod, mult.levs = rm_levels)
# get the expected means:
em_ <- emmeans(rg, ~ phase * treatment)
# run pairwise tests between the treatment groups within each phase
c_ <- contrast(em_, "pairwise", by = 'phase')
# extract the estimates
est_names <- c("fup: control - A", "fup: control - B", "fup: A - B",
"post: control - A", "post: control - B", "post: A - B",
"pre: control - A", "pre: control - B", "pre: A - B")
est_values <- summary(c_)$estimate
names(est_values) <- est_names
est_values
}
# test it
treatment_phase_contrasts(fit_mixed)
```

```
fup: control - A fup: control - B fup: A - B post: control - A
-2.91666667 -2.95833333 -0.04166667 -2.41666667
post: control - B post: A - B pre: control - A pre: control - B
-2.54166667 -0.12500000 -0.75000000 0.08333333
pre: A - B
0.83333333
```

Finally, we will use `car::Boot`

to get the bootstrapped estimates!

```
library(car)
treatment_phase_results <-
Boot(fit_mixed, treatment_phase_contrasts, R = 50) # R = 599 at least
```

`Loading required namespace: boot`

`summary(treatment_phase_results) # original vs. bootstrapped estimate (bootMed)`

```
Number of bootstrap replications R = 31
original bootBias bootSE bootMed
fup: control - A -2.916667 0.044892 0.65137 -2.8333e+00
fup: control - B -2.958333 -0.026805 0.82950 -3.0000e+00
fup: A - B -0.041667 -0.071697 0.40960 -1.6667e-01
post: control - A -2.416667 -0.011444 0.74882 -2.5000e+00
post: control - B -2.541667 0.048310 0.94075 -2.4167e+00
post: A - B -0.125000 0.059754 0.64484 4.3374e-15
pre: control - A -0.750000 -0.129339 0.63190 -7.0000e-01
pre: control - B 0.083333 -0.099923 1.01857 9.1667e-02
pre: A - B 0.833333 0.029416 0.89102 8.3333e-01
```

`confint(treatment_phase_results, type = "perc") # does include zero?`

```
Bootstrap percent confidence intervals
2.5 % 97.5 %
fup: control - A -4.0000000 -1.8000000
fup: control - B -4.3571429 -1.4000000
fup: A - B -0.8571429 0.7083333
post: control - A -4.0000000 -1.3000000
post: control - B -4.0000000 -0.7500000
post: A - B -1.3809524 1.0000000
pre: control - A -2.0000000 0.7500000
pre: control - B -2.2500000 2.0416667
pre: A - B -0.9666667 2.2083333
```

Results indicate that the Control group is lower than both treatment groups in the post and fup (follow -up) phases.

If we wanted p-values, we could use this little function (based on this demo):

```
boot_pvalues <- function(x, side = c(0, -1, 1)) {
# Based on:
# https://blogs.sas.com/content/iml/2011/11/02/how-to-compute-p-values-for-a-bootstrap-distribution.html
side <- side[1]
x <- as.data.frame(x$t)
ps <- sapply(x, function(.x) {
s <- na.omit(.x)
s0 <- 0
N <- length(s)
if (side == 0) {
min((1 + sum(s >= s0)) / (N + 1),
(1 + sum(s <= s0)) / (N + 1)) * 2
} else if (side < 0) {
(1 + sum(s <= s0)) / (N + 1)
} else if (side > 0) {
(1 + sum(s >= s0)) / (N + 1)
}
})
setNames(ps,colnames(x))
}
boot_pvalues(treatment_phase_results)
```

```
fup: control - A fup: control - B fup: A - B post: control - A
0.0625 0.0625 0.6875 0.0625
post: control - B post: A - B pre: control - A pre: control - B
0.0625 0.9375 0.1250 0.9375
pre: A - B
0.3750
```

These p-values can then be passed to `p.adjust()`

for the p-value adjustment method of your choosing.

I’ve demonstrated how to run permutation tests on main effects / interactions, with follow-up analysis using the bootstrap method. Using this code as a basis for any analysis you might have in mind gives you all the flexibility of `emmeans`

, which supports many (many) models!

What is the probability of winning a Quidditch match without catching the snitch?

Quidditch is a fictional team sport played on broomsticks, devised by author J. K. Rowling, that is played by wizards in the fantasy world of Harry Potter. Like any sport, the game of Quidditch has many rules, but I will summarize here only the ones pertinent to our discussion:

- 10 points are gained by throwing the Quaffle through one of the opponent team’s three hoops.
- 150 points are gained by catching the Golden Snitch, which can only be caught by the Seeker.
- The game ends when the Golden Snitch is caught.
- The winning team is the one has the most points at the end of the game.

Given these rules and game objectives, in order to estimate the probability of winning a match we must first estimate (1) how long does it take a seeker to catch the Snitch on average, which determines how long a game lasts and thus how much time is allotted for the teams to score goals, and (2) what is the rate of goal scoring.

Unfortunately, not many pro-Quidditch matches are described in the books, so this is quite hard to estimate. On the one hand, the only game that is described in its entirety is the 1994 Quidditch World Cup final, and it was extremely short: only 15 minutes or so (4]); On the other hand, the longest game ever lasted 3 months (5]). So the average is somewhere between 15 minutes, and 130,000 minutes… great.

For the sake of our simulations, then, I will assume the average game lasts 90 minutes (assuming Rowling has soccer or rugby in mind). I will also assume that the length of a Quidditch game is exponentially distributed, since the probability of catching the Snitch remains constant (as opposed to it getting easier as the game progresses) - the Snitch’s ability to escape capture does not change (magic) and players are substituted throughout the game, so let’s assume an unchanging level of vigilance.

Together, the distribution of game lengths looks something like this:

This makes the 3 month long game very unlikely ( ), but the 15 minute game rather probable ( ).

Throughout the books, the final scores of several Quidditch games are mentioned, but since we don’t know their lengths, it is hard to estimate the rate of goals-per-minute (GPM). Going by the 15 minute world cup, in which the final score was 170-160, or 170-10 if only counting goal points, we can see that there were 18 goals in 15 minutes, or an average of 9 goals per 15 minutes per team, or a GPM of 0.6 (or 0.07 for Bulgaria, and 1.13 for Ireland). Though this is only one game, and a very short one at that, these numbers seem close to the averages seen in basketball, where GPM are around 40/48 = 0.83 (6). This is also in line with the game’s depiction in the films as a fast sport, with many goals scored quickly and in succession.

I will also assume that this rate is constant and unchanging throughout the game, regardless of the score (or fatigue of the players, see above), and so will assume GPM has a poisson distribution, with an average of 0.8, making the GPM distribution look something like this:

```
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
```

(If you’re not that into reading raw code, you can just skip to the conclusions at the end.)

```
set.seed(1994)
sim_quidditch <- function(mean_game_durA, mean_game_durB, GPM_A, GPM_B,
snitch_points = 150, goal_points = 10,
n_sims = 1e6) {
# Sim game durations and snitch Points ------------------------------------
mean_game_dur_rateA <- 1/mean_game_durA
mean_game_dur_rateB <- 1/mean_game_durB
game_dursA <- rexp(n_sims, rate = mean_game_dur_rateA)
game_dursB <- rexp(n_sims, rate = mean_game_dur_rateB)
# Final game durations
game_durs <- ifelse(game_dursA < game_dursB, game_dursA, game_dursB)
# Who got the snitch (and the points)
got_snitch <- game_dursA < game_dursB
snitch_pointsA <- got_snitch * snitch_points
snitch_pointsB <- (1-got_snitch) * snitch_points
# Sim goals and goal points -----------------------------------------------
# Based on the length of the game
goal_pointsA <- rpois(n_sims,GPM_A*floor(game_durs)) * goal_points
goal_pointsB <- rpois(n_sims,GPM_B*floor(game_durs)) * goal_points
# Find Winner -------------------------------------------------------------
total_pointA <- snitch_pointsA + goal_pointsA
total_pointB <- snitch_pointsB + goal_pointsB
who_won <- rep(NA_character_,n_sims)
who_won[total_pointA > total_pointB] <- 'A'
who_won[total_pointA < total_pointB] <- 'B'
list(teamA_points = total_pointA,
teamB_points = total_pointB,
Winner = factor(who_won),
Got_snitch = factor(ifelse(got_snitch,'A','B')))
}
```

Let’s start by simulating 1 million games between two teams with equal GPMs and equally good seekers.

```
res <- sim_quidditch(mean_game_durA = 90, # in mins
mean_game_durB = 90,
GPM_A = 0.8,
GPM_B = 0.8)
with(res,{
table(Winner) /
length(Winner)
})
```

```
Winner
A B
0.497066 0.496060
```

Under these conditions, the probability of winning a game without catching the snitch is:

```
with(res, {
mean(Winner[!is.na(Winner)] != Got_snitch[!is.na(Winner)])
})
```

`[1] 0.03747158`

3.75%! Hmm, that does seem quite unfair… Though it does put Fred and George’s gambling odds at around 1:20!

How many games were determined by the snitch - i.e., had the snitch been caught at the same time by the other team, they would have won?

```
with(res, {
(
sum(teamA_points[Winner == 'A'] - 150 < teamB_points[Winner == 'A'] + 150, na.rm = T) +
sum(teamA_points[Winner == 'B'] + 150 > teamB_points[Winner == 'B'] - 150, na.rm = T)
) / 1e6
})
```

`[1] 0.94918`

94.9%! Not the best of odds…

But just like in real sports, different teams have different strengths… Let’s see what happens when we create two teams who have an equal probability of winning, but differ in terms of their seekers and GPMs.

After some trial and error, I found that if we have a team with a seeker who catches the snitch within 67 minutes on average and a GPM of 0.71, versus a team with a seeker who catches the snitch in within 113 minutes on average but with a GPM of 0.90, both teams have a 50% chance of winning.

```
res <- sim_quidditch(mean_game_durA = 67, # in mins
mean_game_durB = 113,
GPM_A = 0.71,
GPM_B = 0.90)
with(res,{
table(Winner)/length(Winner)
})
```

```
Winner
A B
0.497701 0.490796
```

So, what is the chance of winning without catching the snitch?

```
with(res,{
mean(Winner[!is.na(Winner)] != Got_snitch[!is.na(Winner)])
})
```

`[1] 0.1213732`

12.2%! Still pretty low… very disappointing. Let’s explore each team separately:

```
with(res,{
table(Winner, Got_snitch)
})
```

```
Got_snitch
Winner A B
A 497218 483
B 119494 371302
```

It seems that 99.9% of team A’s wins were due to their better seeker…

…but that only 24.3% of team B’s wins were achieved by their weaker seeker T!

How many games were determined by the snitch?

```
with(res, {
(
sum(teamA_points[Winner == 'A'] - 150 < teamB_points[Winner == 'A'] + 150, na.rm = T) +
sum(teamA_points[Winner == 'B'] + 150 > teamB_points[Winner == 'B'] - 150, na.rm = T)
) / 1e6
})
```

`[1] 0.901089`

90.1%!

- If you’ve caught the snitch, you’ve probably won.
- If you’ve won, you probably caught the snitch.
- If you’ve lost, catching the snitch would have probably changed that.

But… the majority of points are non-snitch related. And it seems to me unlikely that you would have a seeker that was twice as good as the opposing team’s, and so compensating for a weaker seeker seems very likely given this simulation. SO it’s probably best to invest in a better chaser who can bring up your GPM and a better keeper who can lower your opponents’ GPM, than in a seeker who can shave 5 minutes off of his snitch catching time…

Also, maybe someone should change the rules? Perhaps the snitch should only be worth 30 points?

In the blind auditions, candidates have 90 minutes to prepare a dessert, confection or other baked good, which is then sent to the show’s four judges for a blind taste test. The judges don’t see the candidate, and know nothing about him or her, until after they deliver their decision - A “pass” decision is signified by the judge granting the candidate a kitchen knife; a “fail” decision is signified by the judge not granting a knife. A candidate who receives a knife from at least three of the judges continues on to the next stage, the training camp.

*The 4 judges and host, from left to right: Yossi Shitrit, Erez Komarovsky, Miri Bohadana (host), Assaf Granit, and Moshik Roth*

I’ve watched all 4 episodes of the blind auditions (for purely academic reasons!), for a total of 31 candidates. For each dish, I recorded each of the four judges’ verdict (fail / pass).

Let’s load the data.^{1}

```
df <- read.csv('IJR.csv')
head(df)
```

```
Moshik Assaf Erez Yosi
1 1 1 1 1
2 1 0 0 0
3 1 1 1 0
4 1 0 0 0
5 0 1 1 1
6 1 1 1 1
```

We will need the following packages:

```
library(dplyr) # for manipulating the data
library(magrittr) # <3
library(psych) # for computing Choen's Kappa
library(corrr) # for manipulating matricies
```

We can now use the `psych`

package to compute Cohen’s Kappa coefficient for inter-rater agreement for categorical items, such as the fail/pass categories we have here. ranges from -1 (full disagreement) through 0 (no pattern of agreement) to +1 (full agreement). Normally, is computed between two raters, and for the case of more than two raters, the mean across all pair-wise raters is used.

`cohen.kappa(df)`

```
Cohen Kappa (below the diagonal) and Weighted Kappa (above the diagonal)
For confidence intervals and detail print with all=TRUE
Moshik Assaf Erez Yosi
Moshik 1.00 0.34 0.17 0.29
Assaf 0.34 1.00 0.17 0.68
Erez 0.17 0.17 1.00 -0.16
Yosi 0.29 0.68 -0.16 1.00
Average Cohen kappa for all raters 0.25
Average weighted kappa for all raters 0.25
```

We can see that overall - surprisingly low for what might be expected from a group of pristine, accomplished, professional chefs and bakers in a blind taste test.

When examining the pair-wise coefficients, we can also see that Erez seems to be in lowest agreement with each of the other judges (and even in a slight disagreement with Yossi!). This might be because Erez is new on the show (this is his first season as judge), but it might also be because of the four judges, he is the only one who is actually a confectioner (the other 3 are restaurant chefs).

For curiosity’s sake, let’s also look at the coefficient between each judge’s rating and the total fail/pass decision, based on whether a dish got a “pass” from at least 3 judges.

```
df <- df %>%
mutate(PASS = as.numeric(rowSums(.) >= 3))
head(df)
```

```
Moshik Assaf Erez Yosi PASS
1 1 1 1 1 1
2 1 0 0 0 0
3 1 1 1 0 1
4 1 0 0 0 0
5 0 1 1 1 1
6 1 1 1 1 1
```

We can now use the wonderful new `corrr`

package, which is intended for exploring correlations, but can also generally be used to manipulate any symmetric matrix in a tidy-fashion.

```
cohen.kappa(df)$cohen.kappa %>%
as_cordf() %>%
focus(PASS)
```

```
# A tibble: 4 × 2
term PASS
<chr> <dbl>
1 Moshik 0.619
2 Assaf 0.746
3 Erez 0.159
4 Yosi 0.676
```

Perhaps unsurprisingly (to people familiar with previous seasons of the show), it seems that Assaf’s judgment of a dish is a good indicator of whether or not a candidate will continue on to the next stage. Also, once again we see that Erez is barely aligned with the other judges’ total decision.

Every man to his taste…

Even among the experts there is little agreement on what is fine cuisine and what is not worth a doggy bag. Having said that, if you still have your heart set on competing in Game of Chefs, it seems that you should at least appeal to Assaf’s palate.

Bon Appetit!

The data is availabe here↩︎

… or how I stopped trusting standardized coefficients.

Say you want to know how perceived importance of job security is related to feelings of job satisfaction. Let’s also say that you suspect that this relationship might differ between Gen-X-ers and millennials with their MTV and their hip-hop. So you conduct a survey, asking 100 Gen-X-ers and 100 millennials to rank how important they think job security is when job searching and their current job satisfaction. You then plug your data into a 2-step hierarchical regression to test the moderating effect of Generation on the effect of Job Security in your favorite stats program (If you’re a Gen-X-er, SPSS. If you’re a millennial, or a reasonable human being, R). You find:

Term | Estimate | SE | t-value | p-value |
---|---|---|---|---|

(Intercept) | 4.400 | 0.393 | 11.203 | <0.001 |

Group [Millennial] | 0.575 | 0.245 | 2.343 | 0.020 |

Job Security | 0.450 | 0.046 | 9.717 | <0.001 |

Term | Estimate | SE | t-value | p-value |
---|---|---|---|---|

(Intercept) | 3.20 | 0.527 | 6.076 | <0.001 |

Group [Millennial] | 2.45 | 0.613 | 3.994 | <0.001 |

Job Security | 0.60 | 0.064 | 9.391 | <0.001 |

Group [Millennial] * Job Security | -0.30 | 0.090 | -3.320 | 0.001 |

“Very interesting”, you think to yourself. But before drafting your Science submission, you decide to plot your data, just to make sure this trend isn’t somehow driven by outliers, and you find that it is not!

``geom_smooth()` using formula = 'y ~ x'`

“This is great!” you confidently think to yourself, “The relationship between the perceived importance of job security and reported job satisfaction is weaker among millennials!”. But how much weaker? You extract the simple slopes for both groups:

Group | Simple Slope | SE | df | t-value | p-value |
---|---|---|---|---|---|

Gen-X-ers | 0.6 | 0.064 | 196 | 9.391 | <0.001 |

Millennials | 0.3 | 0.064 | 196 | 4.696 | <0.001 |

A slope of 0.6 for Gen-X-ers and of 0.3 for Millennials. You find it hard to interpret these slopes, so you do the next reasonable thing: you decide to standardize job security importance and job satisfaction.

Just to make sure nothing funny happened, you plot your data again:

``geom_smooth()` using formula = 'y ~ x'`

As expected - your data is now scaled and centered! You compute the simple standardized slopes and find that:

Group | Simple Slope | SE | df | t-value | p-value |
---|---|---|---|---|---|

Gen-X-ers | 0.963 | 0.103 | 196 | 9.391 | <0.001 |

Millennials | 0.482 | 0.103 | 196 | 4.696 | <0.001 |

“Amazing,” you exclaim to yourself,” the relationship between importance of job security and job satisfaction is nearly perfect among Gen-X-ers, and is half as strong among millennials!” You open MS Word and start typing away…

But wait - looking at the plot above, it is clear that the *correlation* among Gen-X-ers is far from perfect! How is this possible? What’s really going on here?

The short answer is that that the strength of the association (i.e., the correlation) does not equal the strength of the effect (i.e., the slope). Plotting the relationship for each group separately will shed some light on what’s actually going on:

As is now apparent, the strength of the correlation is equal between the groups - if we were to calculate the correlation between within each group, we would find that they are both exactly (in fact, you can see that I made the scatter of points literally identical between the groups), with the only difference between the groups being the difference in scale (and center) of the two variables.^{1}

The slope, even the standardized slope, only measures the expected change in our outcome variable (e.g., Job Satisfaction) as a function of change in a predictor’s values (e.g., Perceived Importance of Job Security). If we’ve standardized our data, the only thing we’ve changed is that the slope now represents the expected change in an outcome’s standardized values as a function of a change in the predictor’s standardized values.

But why does this not yield what we expect? If in a simple regression analysis the standardized slope ** is** the correlation (i.e., the strength of the relationship), why is that not the case here when measuring simple standardized slopes?

The reason is that when we standardize a variable, we standardize it *across* any other variables that might interact with it, while our question regards the standardized change in as a function of *within* different levels of a moderator variable.

The first thing we should do is understand what we are estimating and testing when fitting a moderation model: we are testing whether the size of variable ’s effect on variable is different within different levels of some other variable (or, that the slope of is different within different levels of ). Nothing more. Nothing less. In other words, no matter how you slice it, regression can only answer questions regarding the size of an effect (What is the size of the effect? Does this size change under some conditions?). However, if we instead have questions regarding the strength of an association (i.e., the correlation) between X and Y (How strongly are they correlated? Does the correlation change under some conditions?), we are using the wrong tool.

A more appropriate tool for testing if the strength of association changes between levels of an is to use Fisher’s transformation to conduct a simple t-test to compare two correlations (read about the logic behind this transformation here, here, and here). This can be achieved in R using `psych::paired.r()`

… or if you’re a Gen-X-er, I guess you *can* do this in SAS or SPSS…

**P.S.**: Above I demonstrated that standardized simple slopes are not the same as “simple” correlations in the context of a moderation model. Similarly, in a multiple regression analysis, standardized coefficients (s) are *not* semi-partial (part) correlations. If you have questions regarding semi-partial correlations / strength of unique associations, you can:

- Convert a standardized coefficient into a semi-partial (part) correlation via (or simply by using
`ppcor::spcor()`

in R). - Determine if two unique association differ in strength by way of a
*dominance analysis*(read about it here and here).^{2}

If this sounds like we violated the assumption of homoscedasticity, you’re absolutely right! In a previous post I wrote that we can overcome such a violation with a permutation test. In this case, this would not help, since the problem here is not with the statistical inference, but with the conclusions we draw from the fitted model parameters - we are assigning the wrong meaning to the numbers we observe. We’re using wrong tool for the question we’re asking. This can only be solved by using the right tool.↩︎

In R,

`lavaan`

can be used to compare standardized coefficients between groups as well as between predictors using`standardizedSolution()`

.↩︎

When you assume,

you make an ass out of you and me.

We all know the assumptions of linear regression. Or at least we think we do. I mean, we learned them at some point, right? Well - at the very least we definitely know that there **are** assumptions!

If you search online (or in your notes from *Linear Models for Undergrads*) you will find various lists listing the assumptions of linear regression, that may look something like this:

- The outcome variable is linearly predicted by the predictors.
- The outcome is normally distributed.
- The predictor(s) are normally distributed.
- No or little multicollinearity.
- Multivariate Normality.
- The outcome is not restricted / truncated / censored in range.
- No auto-correlation.
- The mean of residuals is zero.
- Homoscedasticity of variance.
- The predictor(s) and residuals are uncorrelated.

My goal here is to explain what the assumptions of linear regression actually are, and demonstrate what they are not, in the hopes that if you ever get a note from Reviewer 2 saying:

… this variable is known to be non-normally distributed and extremely skewed. I suggest transforming it by taking log(X).

You can just send them a link to this blog post, and hopefully ~~shut them the hell up~~ come to a mutual appreciation of the beauty of statistics.

So, what are these elusive assumptions of linear regression? The actual assumptions of linear regression are:

- Your model is correct.
- Independence of residuals.
- Normality of residuals.
- Homoscedasticity of residuals.

These assumptions can be split into two categories based on the consequences of violating them:

- Assumptions regarding fitting of the model parameters (assumption 1).
- Assumptions regarding testing the significance of the fitted parameters (assumptions 2-4).

_{Sep, 2019: This section we edited to be more… correct.}

Your model is correct.

What does this actually mean? It means that the fitted model is a good estimate for the relationship between the predictors and the outcome variable. This seems like a rather silly assumption, since the whole idea of fitting a linear model is to see if the predictors do predict the outcome! And yet, I’m sure you already have a vague idea of when this assumption is not met.

For example, when the relationship between a predictor () and an outcome () is not linear:

This relationship is non linear, as in, the model of is incorrect. However, this does not mean that it is impossible to specify a linear model that will correctly estimate this non-linear relationship - we could, for instance, estimate the model . Though non-linear relationships can be modeled with linear regression, they still do require a *linear outcome variable*! A variable may be considered linear when a change of a certain size has the same meaning across the whole range of possible values of that variable. For example, a change from 3 to 6 should have the same meaning as a change from 21 to 24.^{1} This is crucial as that is exactly what we are trying to model - differences (variance) in in the outcome. When an outcome is non-linear (see example below) the interpretation of linear models is at best an approximation, and at worst simply nonsensical.

The assumption of model-correctness is also violated by the addition or omission of variables in the model. For example, multicollinearity makes estimation of model parameters unstable, making the fitted model a bad estimate for the relationship between the predictors and the outcome variable. Likewise, by not include a mediator, moderator of covariate in your model, the model will misrepresent (either over- or under-estimate) the strength of relationship between and .

Note that this assumption is **not** violated if your variables are restricted / truncated / censored - having such variables does hinder your model’s ability to properly predict values outside this range, but it may still be a good (or even the best and most true) model for the range at hand.

This is probably the most crucial assumption of linear regression, as there is no way to overcome its violation other than specifying the correct model. This might include adding or omitting variables in the model, or it might include using some non-linear models, such as a poisson or logistic regression (or others).

The bottom like here is that even if you find that the linear model is significant (even if it is “very significant”), that cannot and never will be enough to overcome the fact that the specified model - is wrong.

Assuming you correctly fit your model, you now come to the dreaded part of testing its (or its covariates) significance, using an - or null hypothesis significance test (NHST). These test require that the following 3 assumptions hold true:

- Normality of residuals.

- Independence of residuals.

- Homoscedasticity of residuals.

But why? These assumptions are all needed since in significance tests we translate the ratio between the estimated parameter and its standard error to the probability of obtaining this ratio under the null hypothesis. This is done by looking up the cumulative probability of an estimated -ratio under the distribution. But the distribution of possible -ratios only takes the shape of the -distribution when the residuals are independent from one another and are normally distributed (the same applies for the -distribution).

But do you know what does not need to be normally distributed? Literally anything else! You can see this quite clearly when looking at the equation for ’s standard error:

The left factor of the square-rooted term estimates the standard deviation of the residuals - the mean squared difference between the outcome and the predicted outcome. In other words, at no point in our modeling or NHST does the shape of the raw outcome or predictor(s) come into play (the right factor of the square-rooted term is simply a normalizing factor for the specific coefficient being tested). So for the love of Gauss, stop pointing out when variables are skewed, heavy tailed and the like. Only the shape of the residuals matters for this assumption to hold. For example, here we have a heavily skewed outcome variable:

This seems alarming, maybe you’d want to combat this skewness with ? But in fact, this is perfectly fine, as we can see when looking at the scatter plot:

Or at the distribution of residuals:

On the other hand, if had you taken to “combat” the skewness,

Oy vay… This is obviously an extreme example, but I make it to drive home the idea that robotically transforming skewed or heavy tailed variables is a bad habit. The only reason to ever transform a variable is if you think that by transforming the variable, you can properly represent its relationship with another variable in a linear model. For example, here we have a beautifully normally distributed outcome variable:

Yet when examining the relationship with a predictor, we find a non-linear relationship:

In this case, transforming into is actually a good idea - and will give a linear model of an exponential relationship ()!

Heed my warning: transformations

alwayschange the pattern of the relationship between the outcome and the predictors. Use transformations only when needed, and avoid using~~them when someone points out that one of your predictors had high kurtosis~~they are not called for.

*(Edit: Read and cite Adam M. Brickman’s excellent 2018 paper, and Andrew F. Hayes’s book (pp. 68-72) about this normalcy-fallacy).*

The independence of residuals is also needed for the proper translation of - or -ratios into probabilities, due to the underlying distribution. This is why un-modeled auto correlations are bad if not properly modeled in time-series data, as are dependent residuals due to repeated measure designs (which are accounted for and modeled in repeated measure / mixed-ANOVA / Linear mixed models).

Where does homoscedasticity (equality of variance) come into play? If we look back at the equation for ’s standard error, we can see that we are using all of the residuals to estimate a single . This is also apparent in the formal definition of the shape of distribution of residuals: , we’re estimating the size of a single . Since we’re estimating a single error term, it would be inappropriate to do so based on heteroscedastic residuals (i.e., mixing residuals from differently scaled / shaped populations).

It is important to understand that if these assumptions are violated, they have absolutely no implications or effect on the correctness of your model. A linear model with heteroscedastic residuals is still a linear model (i.e., the conditional mean of increases linearly as a function of ):

The only thing that is affected by the violations of these assumptions (any of them), is the ability to conduct NHST based on or -ratios.^{2} Put simply, the produced -values, will simply not represent the probability of obtaining the results under the null hypothesis^{3}. This can inflate false discovery rates, or extremely under-power your design (or the exact opposite, but let’s stay with the scare tactics for now).

So what can you do??

If you violate the assumption of independent residuals, improve your model - account for auto-correlations, or add a random intercept / slope to your model. Done!

If your residuals are not normally distributed or are heteroscedastic, you can use bootstrapping of permutation testing to conduct your NHST, and completely circumvent the need for these assumptions to hold!^{4}

It should be noted that ANOVA tests and simple tests are special cases of regression (or of linear mixed models for repeated measure designs) where all the predictors are categorical, and thus everything said in this post can (and should) be applied to these analyses as well. In fact the normalcy-fallacy seems absurd when thinking of a simple test - how can a categorical predictor even **be** normally distributed?

Teaching bad assumptions is bad. It is safe to assume that at least some published regression analyses violate some of the wrongfully taught incorrect assumptions. This means that while conducting these analyses at least some researchers, who held these false assumptions to be true, thought that they were doing something wrong or “statistically illegal” while they were actually being law abiding researchers. Who needs that kind of pressure?

Hopefully, after reading this, you are now equipped with the knowledge to conduct better analyses in the future, but most importantly, next time someone says that an assumption of regression is multivariate normality, you can snarkily reply *“Um, actually, only the residuals need to be normally distributed…”*.

Thanks to Tal Yatziv and Naama Yavor for reviewing this blog post before publication.

If this worried you, it should↩︎

If you’re fitting non-linear models - say, one of the family of GLM (such as logistic regression) - these assumption do not apply, because the test statistic ( or ) do not require them. These models have different assumptions (can you say “overdispersion”?), but I’ll let someone else review those…↩︎

Though it is true that violating these assumptions has only a nominal effect on type-I errors, the effect on type-II errors can be substantive, and cause significant over estimations of study power. In either case, even though inference is robust to these violations, exact -values cannot be reliably interpreted under such conditions (a practice that should be avoided anyway).↩︎

There are methods of correcting for heteroscedasticity, but they have their own sets of assumptions that can be violated. Recursive much?↩︎

By following performance, both behaviorally and electrophysiologically using event-related potential (ERPs), we found that the post-error effects usually measured on the N+1^{th} trial were also apparent, on the N+2 through to the N+4^{th} trials, diminishing in magnitude, until being undetectable on the N+5th trial.

This trend in the decay of post-error effects was found on PES, visual N1 amplitude and P3 amplitude, though the effect on the P3 component rapidly diminished (was non-evident by trial N+2), whereas the differences in the N1 component were still evident in the N+4th trial following the erroneous response.

The results lay further support to the bottleneck account for post-error slowing and show a combination of early attentional and higher-order processing changes that occur after erroneous responses.

These results provide additional support for the bottleneck account for PES. The effects on both N1 and P3 suggest that both low- and high-order processes are triggered by the commission of an error. This is consistent with claims that the best explanation for PES can be achieved with a combination of early attentional and higher-order processing changes that occur after erroneous responses.

Free access link (until October 18, 2018)↩︎

`cheatR`

is a mini package to help you find cheaters by comparing hand-ins. It was developed by Almog Simchon and me in response to students overheard bragging about how an assignment in an first-year undergrad course was “super easy” because “we all just copied from each other!” (though this would later turn out to be an exaggeration).
Our idea was to compare each hand-in to all other hand-ins and see the degree of overlap between them. This was achieved using the `ngram`

r-package to break each hand-in into a list of “phrases” and then to count how many times each phrase appeared across a pair of documents1. Finally, the percent of non-unique phrases was calculated.

We then ran this algorithm across all 300~ hand-ins, and found that it seems like the ~~knuckle-headed~~ overheard student estimation of “we all just copied from each other” was an extreme exaggeration. Looking at the distribution of overlap, we can see the vast majority of overlap was quite small (and even this small degree of overlap could be accounted for by the fact the most hand-ins contained the assignment instructions in them):

As is evident from this graph, there were some hand-ins with a 100% overlap! Zooming in to the 70-100% range, it becomes clearer that some students were mischievous!

Plotting the relations between this subgroup, it was apparent some students had become close friends over their first year…

Other than the cheating students received a failing grade on their assignments, I think we can say that the war on cheaters has escalated - and we cant wait to see the new methods students will use for cheating next year!

If you also want to find cheaters, you can try `cheatR`

for yourself by installing it in R and running it locally, by running:

```
# install.packages("devtools")
devtools::install_github("mattansb/cheatR")
```

or you can try our shiny app!