5
$\begingroup$

Let's say you have a set of numbers S and you want to obtain subsets of $S$, $s_1,s_2,\ldots, s_i$, where $i < N$.

Is there a particular operation that will group the numbers that are "close to each other"?

I will give an example to clarify the question:

Let's say you have a set $S$ which contains $\{1.4, 2, 2.5, 2.7, 14, 16, 49, 57, 58\}$

And let's say that you can have maximum of $N=4$ subsets.

So, you might end up with a result that looks like this:

$s_1 = \{1.4, 2, 2.5, 2.7\}$
$s_2 = \{14, 16\}$
$s_3 = \{49\}$
$s_4 = \{57, 58\}$

I am looking for either the name what such a problem would be called (which I can then use to research and write an algorithm). If you can provide a simple solution, even better.

Thank you for any help with this.

  • 0
    Thanks - I have implemented jug's comment and it is working for some cases, but as whuber pointed it there are cases where it does not do what I would consider to be ideal. Either way it is a great starting point. I will probably end up going with what whuber suggested for my final implementation. Thanks everybody!2010-10-06

2 Answers 2

9

There are many solutions to this problem extant, each rediscovered many times. Search "statistical clustering methods" or "cluster analysis." A particularly nice one for your problem is called "K-means". This was rediscovered by cartographers (!) where it is called "Jenks' method" (and is worth mentioning because it's used by some popular mapping software, so perhaps more people have heard of this than of any other method). The idea behind K-means is to minimize the population-weighted sum of variances of the subsets.

BTW, the challenge in most clustering algorithms lies in determining the number of clusters; usually that needs to be specified (as $N = 4$ in this example) or at least hinted at by the user. For instance, there always exists a minimum K-means solution: partition $S$ into singletons. If you require $i < n$, merge the two nearest neighbors but leave all the other singletons as they are: the sum of variances is then just one-quarter the square of the smallest difference among the $x_i$.

  • 1
    I remembered "clustering", but forgot "k-means". Thanks!2010-10-05
0

There is a somewhat related operation, condensation, in the theory of linear orderings. For example, you can take a linear ordering and identify any two elements $x,y$ which are connected by some finite chain, i.e. there are elements $x = v_1,\ldots,v_n = y$ such that $v_1 < \cdots < v_n$ and there are no elements 'in between', i.e. no $z$ such that $v_i < z < v_{i+1}$. This will compress all copies of $\mathbb{Z}$ and $\mathbb{N}$ (or its opposite) to a single point.

This particular condensation can be used to (de)construct all countable linear orders, if I remember correctly. There are also some other condensations, take a look at Rosenstein's Linear orderings.