Read in the SAT data (in wide form).

library(CMUtils)
library(reshape2)
library(knitr)
library(kableExtra)
d <- read.csv('./sat2.csv')

We have a 2 (skill: math vs. verbal) x 2 (source: self-reported vs. actual-recorded) design, with both factors varying within participants (I’m ignoring sex for this illustration).

Here are two different ways of coding the difference scores. The first ignores denominator issues completely, and just tacks the appropriate sign on each term. It is valid, but hard to interpret (we’ll do it later).

d <- within(d, {
  avg1 <- SATM + SATV + RSATM + RSATV
  skill1 <- SATM - SATV + RSATM - RSATV
  source1 <- SATM + SATV - RSATM - RSATV
  sxs <- SATM - SATV - RSATM + RSATV
})

Here’s a second way of coding the differences. This way counts up the number of terms being averaged (added1) in each expression, and divides by that number. This transformation gets us back to the original metric of the variable, but notice that it is different for each level of effect.

d <- within(d, {
  avg2 <- (SATM + SATV + RSATM + RSATV) / 4
  skill2 <- (SATM - SATV + RSATM - RSATV) / 2
  source2 <- (SATM + SATV - RSATM - RSATV) / 2
  sxs <- SATM - SATV - RSATM + RSATV # no division here! Weird thing, but it's
  # true. Note that this makes it the same as above, so I'm not renaming the 
  # variable.
})

Ok, let’s also plot the means. I’m using functions here that are part of a package I wrote for myself, so ask me about getting access if you’re interested. I’m multiplying by ten because these scores were (for some reason) divided by 10 in this dataset.

with(d, {
  vioplotCM(
    SATM * 10,
    SATV * 10,
    RSATM * 10,
    RSATV * 10,
    names = rep(c('math\n', 'verbal\n'), 2),
    main = "SAT Scores by Skill and Source",
    ylab = "SAT Score",
    labelMeans = T
  )
})
## Loading required package: sm
## Package 'sm', version 2.2-5.6: type help(sm) for summary information
mtext(c('Self-Reported', 'Actual Recorded'), at = c(1.5, 3.5), side = 1,
      line = 3)

Skill Difference

Ok, let’s work through the mean differences we are interested in. First, we want to know about the skill difference. We need to average the two math scores, average the two verbal scores, and then take the difference:

\[(575.62 + 578.29) / 2 - (521.74 + 511.47) / 2 = 60.35\]

So which difference score gives us the correct intercept estimate?

lmSummary(lm(skill1 ~ 1, data = d))
## 
## Call:
## lm(formula = skill1 ~ 1, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.071 -11.571   1.929   9.929  63.929 
## 
## Coefficients:
##             Estimate Std. Error t value f value   R^2 Pr(>|t|)    
## (Intercept)   12.071      1.349   8.950  80.096 0.322 6.19e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.59 on 169 degrees of freedom
lmSummary(lm(skill2 ~ 1, data = d))
## 
## Call:
## lm(formula = skill2 ~ 1, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.535  -5.785   0.965   4.965  31.965 
## 
## Coefficients:
##             Estimate Std. Error t value f value   R^2 Pr(>|t|)    
## (Intercept)   6.0353     0.6744  8.9497 80.0965 0.322 6.19e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.793 on 169 degrees of freedom

The second one. Notice that the expression we just computed from intuition matches exactly to the expression I used to compute the difference score, skill2.

I will skip the source difference because it is exactly the same as the skill difference.

Interaction

The interaction can be thought of as the way that the skill difference differs across the levels of source. Equivalently, the source difference is not the same across the levels of skills. I am going to focus on the first interpretation.

Let’s get the skill difference for each source separately:

\[skillDiff_{selfreport} = SATM - SATV\]

\[skillDiff_{actual} = RSATM - RSATV\]

Importantly, these are both in the metric of SAT scores already. Since we are not averaging anything within these expressions, we do not divide by anything. We ultimately want the difference between those:

\[skillDiff_{selfreport} - skillDiff_{actual} = (SATM - SATV) - (RSATM - RSATV)\]

Distributing the negative signs gives us our interaction difference score:

\[sxs = SATM - SATV - RSATM + RSATV\]

Let’s check this intuition-devised difference score. We can compute the source difference in the skill-mean-differences:

\[(575.62 - 578.29) - (521.74 - 511.47) = -12.94\]

Does it work?

lmSummary(lm(sxs ~ 1, data = d))
## 
## Call:
## lm(formula = sxs ~ 1, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.706  -1.706   1.294   1.294  22.294 
## 
## Coefficients:
##             Estimate Std. Error t value f value   R^2 Pr(>|t|)    
## (Intercept)  -1.2941     0.3647 -3.5484 12.5914 0.069 0.000502 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.755 on 169 degrees of freedom

Yes, it does!! It’s off by a factor of 10 because of that quirk of the dataset, but the estimate is correct.

Comparison to Contrast Codes

This is different than contrast codes!! There is a critical distinction to be made here. We have said that these difference scores are, in some way, just a version of contrasts, but for factors that vary within units. It is true that contrast codes, as we constructed them for between-unit factors, will give you the correct coefficients for difference scores. But the scaling turns out to be quite different. Let’s try to unpack that.

We have seen the difference scores we would use for the skill effect and the skill by source effect, scaled to yield intercept estimates that match the mean differences. Here they are again:

skill1 <- SATM - SATV + RSATM - RSATV
source1 <- SATM + SATV - RSATM - RSATV
sxs <- SATM - SATV - RSATM + RSATV

What if we were to write the contrast codes that match, but pretending these were between-participant factors? I want to write the fractional codes so that we know they would yield mean differences. Our first step is to write down a set of orthogonal codes, one for each main effect, and then multiply those two pair-wise to yield the interaction code:

Now we need to scale them. How do we know how to scale them? We know how because we understand what the coefficients mean when we add them to the model, like so:

\[SAT_i = \beta_0 + \beta_1 * skillC_i + \beta_2 * sourceC_i + \beta_3 * sxsC_i + \varepsilon_i\]

We know we will want to interpret \(\beta_1\) for the test of the skill effect. We know that a coefficient for a predictor is interpretted as the predicted increase in the outcome for every 1-unit increase in the predictor. So, we want to cleverly scale our code so that a 1-unit increase in the code represents the difference between the groups that we care about. Right now, our code has a 2-unit increase between the math and verbal skills, so we need to scale it by a factor of .5. Same thing for the source skill. Then, it turns out that we simply re-multiply those codes and our interaction still works, and is still interpretable as the mean difference in mean differences.

Super critical big point number 1: this scaling procedure is based on constructing a predictor variable that will be multiplied by the coefficient of interest. That is, we are interested in interpretting a slope!!

\[\beta_1 * skillC\]

That is not true for difference scores!! When we construct models with difference scores in them, they become outcomes, not predictor variables. Then, we interpret intercepts, not slopes!!!

Super critical big point number 2: the scaling procedure for difference scores must be based on the fact that we interpret intercepts, not slopes!!.

Let’s go back to our actual situation, where we have two factors and their interaction, all varying within-participants. Here are the three separate models we want:

\[skill2_i = \beta_0 + \varepsilon_i\]

\[source2_i = \beta_0 + \varepsilon_i\]

\[ses_i = \beta_0 + \varepsilon_i\]

Slopes are nowhere to be found in these equations! Even if we added a between-participant predictor, like sex, we are still only interested in the intercept.

So: how do we interpret an intercept? It is simply the predicted value of the outcome at the zero point of all other predictors. Did you see it? The intercept does not get multiplied by anything. That means that we want to scale our difference scores so that they are meaningful, in their raw form, as units on the outcome variable. That leads to the intuitions presented in the previous section.

So, let me clarify one important point. You can use contrast coding schemes, as you would write them for between-unit predictors, to get the correct signs and multipliers for your difference scores. But know that, just like with contrast codes, writing those raw difference scores will not automatically result in interpretable scores. You still must take one more step to scale them so that your intercepts will represent units for your outcome2.


  1. except for the interaction because the added terms there are not being averaged.

  2. You can use the method described in the first section, for example.