0
$\begingroup$

I have a simple best-fit-line algorithm similar to this description.

Without memorizing the points history, it is easy to calculate a rolling best fit line as long as we remember (store) the intermediate values used to calculate the BFL:

sumX,sumY           //  the sum sumX2, sumY2,       //  sum of squares sumXY, and count    //  sum of (X*Y), and count 

when a new point arrives (Xn,Yn), the line simply update the sums by adding the latest value based on Xn, Yn. then calculate BFL:

XMean = SumX / Count YMean = SumY / Count Slope = (SumXY - SumX * YMean) / (SumX2 - SumX * XMean) YInt = YMean - Slope * XMean 

to get

 Y = Slope * X + YInt 

However, as the number of points grow, the new point will have less of an effect on the best-fit-line.

is there a way to mathematically reduce this weight? for example when the count reaches 20, modify the intermediate variables to represent a line with a weight of 10 points:

when count = 20 ->      count = count/2     sumX = sumX/2    // this will result in the same XMean, but the same line?     sumY = sumY/2    // same YMean here as well 

and the values for sumXY,sumX2,sumY2 are even more perplexing to me.

Specifically, I'm looking for equations for these intermediate values to result to the same line but with less weight.

-TIA

  • 0
    You have nowhere defined "weight", so it's hard to answer your question. In any event, if the total number of points is, say, 20, then the 20th point has exactly as much influence on the final answer as the first point, so it's not clear to me that there is actually a problem here.2012-11-19
  • 1
    @Gerry Actually, if you measure "influence" of a data point as, say, the derivative of the fitted slope with respect to the point's location, then influences vary--a lot. There also is a fairly [current and natural definition of "weight"](http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares) in this context.2012-11-19
  • 0
    A duplicate of this question appeared two years ago at http://stats.stackexchange.com/questions/6920/efficient-online-linear-regression and has some answers there.2012-11-19
  • 0
    @Whuber the question is the same indeed. Their answers, however has me specifically no closer to finding answers to this problem. You are correct, as by "weight", I'm referring to influence. if I knew better, I'd state that I'm looking for a limited window regression. One approach would to be to simply have 2 sets of intermediate variable representing k points each (as opposed to 2*K data points collected) and empty one set to increase the influence of the new points and alternate. That way the influence is at least size of one set and at most two set sizes.2012-11-19
  • 0
    Actually, "weight" and "influence" are two completely different things. (See http://www.casact.org/pubs/proceed/proceed94/94123.pdf *inter alia.*) You appear to be looking for the regression version of a time-weighted average. This can be done--as I suggest in an answer to that other question--with a suitable updating algorithm.2012-11-19
  • 0
    @whuber, sure, influences vary a lot --- but they vary depending on where the point is, not on whether it's the 1st point or the 20th, right? The way I read the question, OP wants the 20th point to have the same ability to move the best-fit line that the 10th point had. Also the link you give for weight seems to be for situations where the data points have varying uncertainties --- this isn't what OP is asking about, is it?2012-11-19
  • 0
    @GerryMyerson apologies for the lack definition of weight and influence. Though you are correct that i'd like to discard the influence of the very old points, not to redistribute the uncertainty weights. The idea was to simply come up with an equation that would allow me to represent a 20 point regression as though it only has 10 or even 19. Any solution that is not O(1) won't work on this tiny device. The two "bucket" method i described is more like an air-lock as opposed to a sliding window. Something less crude would be nice. Thanks to you both, it's already more clear.2012-11-20
  • 0
    I'm still not 100% sure I understand, but how about this: make the update for the 20th point (and subsequent points) twice. That should give it twice the "importance" of the earlier points.2012-11-20
  • 0
    @GerryMyerson that would mean re-doubling the updates every 20 points which would result in overflow. Since simply dividing the intermediate variables (SumX,SumY,...) by a half would distort the line, and not being able to carry "sliding window" data, what I ended up doing was to use two sets of intermediate variables (a+b). when 'a' reaches 10 points, copy it to 'b' and zero 'a'. That way the BFL based on the combination of 'a' and 'b' will always represent the latest 10-19 points.2012-11-21

0 Answers 0