1
$\begingroup$

Say I have a range of $[0...256]$, and I have a stream of data representing a change in an attribute, how do I determine if two points change by a significant amount, i.e. $[... 5,240 ...]$?

Is there an algorithm that works in this situation that I could use to see if the points are different in a significant way, based only on the two numbers and the range of values?

I plan to use this in image analysis to find boundaries between different colored pixels.

  • 0
    Subtraction? ${}{}{}$2017-01-13
  • 0
    @copper.hat I mean in context, I want to tell if something is following a pattern or if it's a random deviation.2017-01-13

2 Answers 2

1

What do you call significant? You can certainly subtract them and divide by the range to see what fraction of the range they differ by. Is differing by $10\%$ of the range (here $25$) significant?

  • 0
    That's part of my problem, if the difference across the range steadily increases, I don't want it to be marked as different, I want it to be different if there is a drastic, random change.2017-01-13
  • 0
    If I have $[1,3,7,10,15,21,100]$, I want the last element to stand out2017-01-13
  • 0
    There is no magic to it. You need to look at a bunch of data and see what you think is significant. Until you have some sort of model, you can't tell what is significant.2017-01-13
0

Comment. @RossMillikan is correct that you need to develop a criterion for significance. In your Comment on his answer you used the phrase "drastic, random change." What might that mean? Not having actual data at hand, I decided to do some simulation.

Long strings. Here are results of one experiment. I looked at strings of length $n = 100$ of digits randomly chosen from among $\{0, 1, 2, \dots, 256\},$ and looked at the maximum absolute differences max.dif of successive numbers in each of a million such strings. The mean was about 233.5 with SD about 12.5; about 90% exceeded 216, and about 99% exceeded 199. To me, it seems clear that a difference between successive values that exceeds 200 should always be considered remarkable.

A histogram of these max.difs is shown below.

enter image description here

Short strings. To see how big a difference is remarkable for relatively short strings, I repeated the simulation with $n=10.$ The mean was about 180.6 with SD about 37.7; about 90% exceeded 229, and about 99% exceeded 129. To me, it seems clear that a difference between successive values in a string of 10 that exceeds 90 should always be considered remarkable.

m = 10^6;  pop = 0:256;  n = 10;  max.dif = numeric(m)
  for(i in 1:m) {
  string=sample(pop, n, repl=T)
  max.dif[i] = max(abs(diff(string)))  }
mean(max.dif);  sd(max.dif)
## 180.5514
## 37.70648
quantile(max.dif, c(.01, .10, .90))
## 1% 10% 90% 
## 89 129 229 

enter image description here

Perhaps these particular experiments are not useful, but at least they are based on something approaching an objective criterion. Based on your experience, you may be able to do simulations with results that are more useful.

Control charts. An entirely unrelated path of inquiry is for you to look at the literature on 'control charts', long used by quality engineers to detect when a process is going 'out of control'. Several criteria are commonly used as signals for trouble, some of which may be useful here.

Change point in time series. (Added later.) Over lunch a colleague suggested you might look in the literature of time series for criteria to detect 'change points', usually detecting a point in time at which the mean of the series appears to change.