0
$\begingroup$

I'm not very good at math but I need to know how to get a set of numbers that are closest to each other.

It means that if I have 5 numbers [-1, 2, 7, 11, 3222]. Now it is obvious (to me) without any rules to see that correct numbers would be -1, 2, 7, 11 or even -1, 2, 7. However I have no idea how to implement some kind of rule that will allow programatically filter a set of numbers and get viable ones.

You can clearly see that simple solutions like calculating median from all numbers and then getting closest numbers to it would be wrong.
Maybe if I would take median of all numbers except 3222, calculate median, get all numbers from set that are +10/-10 it would work well, however, how do I make computer know that 3222 does not belong here...

So basically I'm pretty stuck here. As I said I'm not very good with math so don't judge me by asking this silly question.

  • 0
    What exactly is your definition of "closest" here? I do not really see a pattern in your example, except "the differences are small"...2012-03-20
  • 2
    I would recommend posting to the statistics website to ask "how do I recognize the outliers in a given data set?"2012-03-20
  • 0
    What sequences are you looking at ... what are - in your model of "closeness" - the closest numbers in $[1,2,3, 1000, 1001, 1002]$, the median ansatz doesn't fit well here.2012-03-20
  • 0
    @JohannesKloos Well I would say that closest ones are the ones who are in rage of +50 and -50. Like in example that I wrote - if I take away `3222`, calculate average of the ones who are left and than take everything from set that is +50 and -50 I would have a viable array.2012-03-20
  • 0
    @martini yes, I didn't see that. Too bad, however in my data sets there is only 1-3 numbers that are so big compared to others. So if I ignore this problem is it still possible?2012-03-20
  • 0
    @GerryMyerson Thanks for advice, will do.2012-03-20
  • 0
    Would something like [bucket sort](http://en.wikipedia.org/wiki/Bucket_sort) help here? Perhaps not directly. But might give some useful ideas?2012-03-20

1 Answers 1

3

You might one to sort your data in an increasing (or decreasing, I have no problem with that) order and compute a coarse discrete derivative. Then you can split your data in a certain number of groups (let's call them clusters) which are "dense".

For instance, if you have a serie as follows: $[1, 2, 9, 7, 260, 6, 5, -8, 255, -300, -2, 0, 15, -302]$ you can sort them using a quicksort algorithm (you can find some fast implementation) and you get: $[-302, -300, -8, -2, 0, 1, 2, 5, 6, 7, 9, 15, 255, 260]$ (hope I didn't forget any) and apply a simple finite difference filter $[-1, 1]$ on you vector which gives you $[2, 242, 6, 2, 1, 1, 3, 1, 1, 2, 6, 240, 5]$. Having a look at this serie clearly tells you that you have three groups of number and the high peaks in the derivative tells you how to split your numbers.

This is a really naive approach. If you have really large, evolving data, you might want to have a look at (online) clustering algorithm.

Hope it helped