^{1}

^{*}

^{1}

^{*}

A common homework problem in texts covering calculus-based simple linear regression is to find a set of values of the independent variable which minimize the standard error of the estimated slope. All discussions the authors have heard regarding this problem, as well as all texts with which the authors of this paper are familiar and which include this problem, provide no solution, a partial solution, or an outline of a solution without theoretical proof and the provided solution is incorrect. Going back to first principles we provide the complete correct solution to this problem.

A homework question, occurring in several oft cited best-selling introductory texts covering calculus-based simple linear regression, goes something like this:

Suppose we are to collect data and fit a straight-line simple linear regression,

From [

which has standard deviation

The estimated standard deviation, or standard error, is found by replacing

The error variance

the corrected sum of squares of the x’s.

Many texts which include this problem provide no solution. Every discussion that the authors have heard discussed or seen in a solutions manual suggests, without proof, that in order to maximize SXX if n is even, half of the observations should be taken at A and half at B. Many texts that include a solution ignore the possibility that n is odd, even though no condition on n was provided in the question. When a solution is provided for n odd, every solution we have seen suggested without proof that

In the sequel we show that for n even, the “usual” solution of choosing half of the observations to be taken at A and the other half to be taken at B is correct. For n odd we show that in order to minimize the standard error,

Our goal is to find the set of

Since the

Setting this equal to zero we have

If

Let _{2} be the number of observations taken at

The quantity

The function with constraints given in the previous paragraph may be maximized in any number of ways. Possibilities considered by the authors include the following: taking the variables of interest to be continuous and maximizing the function through the use of calculus, hoping for integer values which would then be the optimal solution [

Let

Thus maximizing _{2} must be zero. Assume that

which is a contradiction to the assumption that

Now one of our constraints reduces to

This last is simply a parabola which we need to maximize over

For the common homework problem appearing in approximately half of the texts covering calculus-based simple linear regression with which the authors are familiar, and which was posed at the beginning of this paper, we have shown that if n is even, the oft given solution to choose half of the points at which to take observations at either end of the interval is correct. However, for odd n we have shown that the only previously given solution to place one point in the center of the interval and half of the remaining points at each end of the interval is incorrect, and that the correct solution is to choose nearly half, either

We part with the common caveat that this oft given textbook problem is of little use in most realistic applications unless it is known that the true relationship among the data is linear, as the solution affords us no opportunity to check this assumption with the observed data. However, the authors would submit that there is a difference between being “useless in practical situations” and “understanding something fundamental about simple linear regression”. We believe that it is important for a student to understand the theory underlying simple linear regression, and this importance is supported by the inclusion of the problem in a large number of highly cited and best-selling texts. Unfortunately, many of these texts provide no solution, some provide a partial solution and others provide an incorrect solution. No texts with which we are familiar, nor their solutions manuals, provide a complete and correct solution. This common textbook problem affords the student the opportunity to understand what drives the variance of the parameter estimate, and as such deserves a correct solution.

The authors wish to thank Dr. Ho Kuen Ng for a useful discussion on optimization.