Estimating the Mean Square Error, MSE
In linear regression, we are modeling the dependent variable using this model:
Y = β0 + β1X + ε
Here, Y is the dependent variable, X is the independent variable, β0 is the expected value of Y when X = 0 in the population, β1 is the effect of X on Y in the population, and ε is random variation unexplained by the model.
To perform statistical inference, we make the usual assumption that
ε ~ Normal(0, σ²)
The mean square error is an estimate of that σ². It is also a measure of the variance unexplained by the model and data. There are different methods for estimating this value. Ordinary least squares is one method. Its strengths are that it is easy to perform, it is exact, and that it is straight forward to create the estimators.
The Problem
Example #593: Let us model (explain) the weight of a leaf using its thickness. To explore this, we collect data. The data consist of two measurements on each unit (leaf): thickness and weight. Thus, our data are
Data table
Leaf Number | Thickness [μm] | Weight [g] |
1 | 3 | 7.2 |
2 | 3.2 | 4.9 |
3 | 7.8 | 3 |
4 | 3.6 | 5.7 |
5 | 3.3 | 5.8 |
6 | 3.9 | 6.5 |
7 | 6.4 | 4.4 |
8 | 5.6 | 6.2 |
9 | 6.4 | 3.3 |
With this information, we estimated the linear model to be
weight = 8.2696 + (-0.6349) thickness
What is the mean square error?
Information given:
To summarize the above, the values of import are:
Summary statistics from the problem
\( \bar{x} \)
| = |
4.8 |
\( \bar{y} \)
| = |
5.2222 |
\( b_0 \)
| = |
8.2696 |
\( b_1 \)
| = |
-0.6349 |
Note that \( b_0 \) is important in estimating the y-intercept. If you are unsure how to calculate it, or if you would like more practice doing so, please see the OLS estimate of the y-intercept tutorial.
Note also that \( b_1 \) is important in estimating the mean square error. If you are unsure how to calculate it, or if you would like more practice doing so, please see the OLS estimate of the slope tutorial.
Your Answer
You got the correct answer of b1 = 0.9107. Congratulations!
Unfortunately, your answer was not correct. Either try again or click on “Show Solution” below to see how to obtain the correct answer.
Assistance
Hide Solution
$$ \begin{align}
MSE &= \frac{1}{n-2}\ \sum_{i=1}^n (y_i - b_0 - b_1 x_i)^2 \\[3em]
&= \frac{1}{9-2}\ \sum_{i=1}^9 \Big(y_i - 8.2696 - (-0.6349) x_i\Big)^2 \\[1em]
&= \frac{1}{7}\ \sum_{i=1}^9 \Big(y_i - 8.2696 - (-0.6349) x_i\Big)^2 \\[1em]
&= \frac{1}{7}\ \Big[
\Big(7.2 - 8.2696 - (-0.6349)3\Big)^2 + \Big(4.9 - 8.2696 - (-0.6349)3.2\Big)^2 + \Big(3 - 8.2696 - (-0.6349)7.8\Big)^2 + \Big(5.7 - 8.2696 - (-0.6349)3.6\Big)^2 + \Big(5.8 - 8.2696 - (-0.6349)3.3\Big)^2 + \Big(6.5 - 8.2696 - (-0.6349)3.9\Big)^2 + \Big(4.4 - 8.2696 - (-0.6349)6.4\Big)^2 + \Big(6.2 - 8.2696 - (-0.6349)5.6\Big)^2 + \Big(3.3 - 8.2696 - (-0.6349)6.4\Big)^2 \Big] \\[1em]
&= \frac{1}{7}\ \Big[
\Big(7.2 - 8.2696 - (-1.9046\Big)^2 + \Big(4.9 - 8.2696 - (-2.0316\Big)^2 + \Big(3 - 8.2696 - (-4.952\Big)^2 + \Big(5.7 - 8.2696 - (-2.2856\Big)^2 + \Big(5.8 - 8.2696 - (-2.0951\Big)^2 + \Big(6.5 - 8.2696 - (-2.476\Big)^2 + \Big(4.4 - 8.2696 - (-4.0632\Big)^2 + \Big(6.2 - 8.2696 - (-3.5553\Big)^2 + \Big(3.3 - 8.2696 - (-4.0632\Big)^2 \Big] \\[1em]
&= \frac{1}{7}\ \Big[
\Big(0.835\Big)^2 + \Big(-1.338\Big)^2 + \Big(-0.3176\Big)^2 + \Big(-0.2841\Big)^2 + \Big(-0.3745\Big)^2 + \Big(0.7064\Big)^2 + \Big(0.1936\Big)^2 + \Big(1.4857\Big)^2 + \Big(-0.9064\Big)^2 \Big] \\[1em]
&= \frac{1}{7}\ \Big[
\Big(0.6972\Big) + \Big(1.7903\Big) + \Big(0.1009\Big) + \Big(0.0807\Big) + \Big(0.1403\Big) + \Big(0.499\Big) + \Big(0.0375\Big) + \Big(2.2072\Big) + \Big(0.8216\Big) \Big] \\[1em]
&= \frac{1}{7}\ 6.3747 \\[1em]
&= 0.9107 \\[1em]
\end{align}
$$
For these data, the mean squared error is MSE = 0.9107. This is the point estimate for the value of σ² in the model.
Hide the R Code
Copy and paste the following code into your R script window, then run it from there.
## Import data
thickness = c( 3, 3.2, 7.8, 3.6, 3.3, 3.9, 6.4, 5.6, 6.4 )
weight = c( 7.2, 4.9, 3, 5.7, 5.8, 6.5, 4.4, 6.2, 3.3 )
## Model the data
mod = lm(weight~thickness)
summary(mod)
In the R output, you have the typical regression table. The mean square error (MSE) is not on the table. It is a component of many parts of the table, but it is not explicitly there. To obtain the MSE, these two lines must also be run:
n=length(thickness)
sum(residuals(mod)^2)/(n-2)
Here, the number outputted is the mean square error. Note that the penultimate line determines the sample size. The final line calculates the residuals, squares them, sums the squares, then divides by n − 2... all in one line!