2
$\begingroup$

Lets say I am trying to find a linear regression between Weight and Height of a person. $W=b_0+b_1 H+e$

The data I have gathered from 8 people is like this:

# W(kg)  H(cm) 1. 68    168 2. 64    170 3. ?     160 4. ?     180 5. ?     145 6. ?     191 7. 69    185 8. 80    191 

Where I know that the sum of ?'s is 280, but I do not know exact data of each of them (because, let's say, at the time, I only had scales, which had a minimum scale of $200$. Dumb reason, I know, but that's just for the sake of example).

So, my question is this: how do I create my $W$ matrix so I can make computations (to find out $b_0$ and $b_1$ using least squares method)? :)

  • 0
    Since both answers including comments have ended up rather long, here's a summary: Both answers lead to the same result, with different approaches; Gottfried's iteratively uses the fact that the slope can be computed using only the deviations from the mean, whereas mine integrates over all values of the missing weights consistent with the known sum to show that the four measurements should be replaced by four measurements of their averages.2012-02-23

3 Answers 3

0

Just for the future generations I am going to put a link to a program that I used to diaggregate the data for me. ECOTRIM. Event it is an old program, it is working decently and diaggregated data for me very well. I did compare the real GDP data with diaggregated one, and it was shooting as close as not more than 5% away. Quite a decent tool.

1

The minimization of the mean square error can be regarded as a result of maximizing the likelihood that the data resulted from normally distributed errors. To deal with the aggregated data, we could integrate this likelihood over all values consistent with the constraint. The likelihood is

$\prod_i\mathrm e^{-\beta(w_i-(b_1h_i+b_0))^2}\;,$

and the integral over all values consistent with the constraint is

$ \iiiint\mathrm dw_1\mathrm dw_2\mathrm dw_3\mathrm dw_4\delta(w_1+w_2+w_3+w_4-280)\prod_i\mathrm e^{-\beta(w_i-(b_1h_i+b_0))^2} $

(where I've numbered the unknown values $1$ to $4$ for convenience). Integrating out $w_4$ yields

$ \prod_{i\gt4}\mathrm e^{-\beta(w_i-(b_1h_i+b_0))^2}\iiiint\mathrm dw_1\mathrm dw_2\mathrm dw_3\mathrm e^{-\beta(w^\top Aw-2s^\top w+c)} $

with

$ \begin{eqnarray} A_{ij}&=&\delta_{ij}+1\;,\\ s_i&=&280+b_1(h_i-h_4)\;,\\ c&=&280^2+\sum_{k=1}^4{(b_1h_k+b_0)^2}-2\cdot280(b_1h_4+b_0)\;, \end{eqnarray} $

where $i$ and $j$ run from $1$ to $3$. The integral is proportional to $\mathrm e^{\beta(s^\top A^{-1}s-c)}$.

With

$\displaystyle A^{-1}=\frac14\pmatrix{3&-1&-1\\-1&3&-1\\-1&-1&3}\;,$

this is $\mathrm e^{-4\beta(\overline w-(b_1\overline h+b_0))^2}$, with $\overline w=(w_1+w_2+w_3+w_4)/4$ and $\overline h=(h_1+h_2+h_3+h_4)/4$.

Thus, you should treat these four measurements as if you had made four measurements of the average weight $\overline w$ at the average height $\overline h$.

  • 0
    OK, I've printed it.....2012-02-23
1

Here I propose an iterative approach.

[Update 2]: Unfortunately I've taken the regression in the wrong direction, but the general idea is unaffected of this. The numerical results where the regression takes the other direction is at the end


First we know, that the regression-parameter for the slope can be computed by the deviations from the mean only. In the setup for the problem we have two groups of data:

A = 4 (x,y)-measures where both x and y are known and

B = 4 (x,y)-measures, where the x-values are not known, but sum up to 280. While their mean is known, their deviations from the mean is arbitrary except that they sum up to zero. Thus we can define that x-deviations having just the same values as their known y-deviations, scaled by any constant factor b, which is then some arbitrary slope in the scatterplot.

Next we do a regression based on the data in A only. What we get is the equation for the regression $\small \hat{y}_A = 83.4 + 1.35 x_A $ (here the slope is b~1.35)

Since we can determine the x-deviations in B arbitrarily, just to sum up to zero, we can use the deviations of the y-values and rescale them by the factor of $\small {1 \over 1.35} $ What we get then is the table for the B-data
$\small \text{ B =} \begin{array} {rr} 63.350& 160\\ 78.127& 180\\ 52.268& 145\\ 86.255& 191 \end{array} $
If we insert that into the original table we get the sum-of-squares of the residues to about 291.01 (which was also the minimum that I could get with some experimenting).

This might still be incomplete (and thus suboptimal) because the mean of the x-values in the complete data-set is slightly different from the means of the x-values in A (70.25) and in B (70) and the common optimum must be determined over the complete dataset; so possibly this must be extended to a recursive procedure. So if the above is not completely wrong or misleading, but useful so far, then that recursive procedure might be added later.


Update: I did recursion to adapt the solution according to the problem of different means in the A and B x-data. I got a small improvement. B becomes now
$\small \text{ B =} \begin{array} {rr} 63.50640& 160\\ 77.93662& 180\\ 52.68374& 145\\ 85.87324& 191 \end{array} $

the equation becomes $\small \hat{y} = 76.55811894 + 1.385980479 x $ and the sum-of-squares of the residues becomes now 290.887311 which is an improvement of about -0.26. After this the recursion becomes stable in the leading six decimals


[update 2,3] Upps, I've taken the wrong direction of the regression. If I take the other way I get $\small \text{ B =} \begin{array} {rr} 66.8704363308& 160\\ 73.8250222623& 180\\ 61.6544968822& 145\\ 77.6500445247& 191 \end{array} $

the equation becomes $\small \hat{x} = 9.707035 + 0.34773 y $ and the sum-of-squares of the residues becomes now 72.980855 after a couple of recursions.


[Update 4]

Excel-generated image

The excel-generated image shows the regression lines for the complete data (black) for the incomplete/estimated data(red) and the complete data(blue). The solution of Joriki might be explainable by the effect, that the 4-fold imputation of the mean of the incomplete set adds the same "weight" of errors to the complete model as the imputation found by the iterative method, because the slope for the incomplete data can be set arbitrarily.

  • 0
    I've done 5 pictures of the iteration process. Initially I insert the mean 70 for each of the missing values and compute the regression for the complete set and for the estimated set. Next I adapt the estimated values according to the regression-slope of the complete set and repeat. I've the zeroth,first, second, tenth and fifitieth iteration. See http://go.helms-net.de/math/divers/mse/MSE_120223_Regr.htm The optimum is achieved when the two slopes are identical (the four y-distances are equal).2012-02-23