2
$\begingroup$

I have a stream of integer values being generated $V_1,\cdots, V_n$ and want to calculate how the value of $V_{n+1}$ is correlated to $V_n.$ I would also like to calculate this at run time as additional values arrive. How do I do this?

I am implementing a local search algorithm and the values I am trying to correlate are the cost weightings of consecutive local optimum. I do not know how these are related but am trying to find out! That said .. I know the values will not have a zero mean. As the quality of the final solution I get is dependent on quickly evaluating lots of solutions on the way I need these stats to be generated efficiently.

Also it is worth mentioning that the frequency with which I need to read the correlation is less than the frequency with which new data points arrive, so if there is an efficiency to be gained in leaving parts of the calculation to when I really need to know the value then this is relevant.

I am separately calculating the Standard Deviation of the values at run time so can use this to convert from covariance to correlation.

Best Regards

  • 0
    How would you calculate it from the full sequence?2011-06-30

2 Answers 2

2

Assuming the sequence is stationary and zero mean:

You want to estimate

$ r_1 = E(v_n v_{n+1})$

The usual way (though the normalizing factor can also be $M$) is simply an average $\hat{r}_1 = \frac{1}{M-1}\sum_{n=1}^{M-1} v_n \; v_{n+1} $

If you want to compute it iteratively, it's straigforward. For any sequence $x_n$ its iterative average $m_n$ is trivially $m_n = \frac{(n-1) \; m_{n-1} + x_n}{n} $

You just plug that into the above formula (i.e. replace $x_n$ by $v_n v_{n-1}$). This is exact, but can have numerical problems for large $n$; if that's a problem you can choose a memory factor $\alpha \lesssim 1 $ and compute instead $m_n = \alpha \; m_{n-1} + (1-\alpha) \, x_n $

Added: This also has the advantage of being able to adapt to non-stationary sequences, if the statistics vary slowly. You then must make a compromise on the election of $\alpha$: too small (little memory) corresponds to an effective average of few past values, and hence the estimator will have high variance; too big values will remember too much of the past, and will not adapt to the non-stationary. You could adopt some "characteristic time" $M$, an order-of-magnitude value of how many past values you want to keep, and use $\alpha = -1/\log(M)$

  • 0
    @Howard: If the process does not have zero mean, but it's constant (stationary sequence) you can estimate it by a similar procedure, and subtract it from the sequence values.2011-07-01
1

Do you know how to calculate the correlation of two data sets $\lbrace x_1,x_2,\dots,x_n\rbrace$ and $\lbrace y_1,y_2,\dots,y_n\rbrace$? Why not do that with the second set being $\lbrace x_2,x_3,\dots,x_{n+1}\rbrace$?

  • 0
    @Howard, I guess it means that you don't have to update both $\sum x_i$ and $\sum y_i$, you can just update the latter and use its previous value as the new value of the former. Similarly for the sums of the squares.2011-07-01