4
$\begingroup$

For a sequence of numbers with increasing order, $a_1 < a_2, \dots < a_n$, I want to know a measure to describe the the extend of clustering in the sequences.

For instance, a sequence like $1, 2,3,4, 100, 101,102,103$ and another sequence $10, 12, 32, 45, 66,77, 89,102$, it is clear to see the first sequence have higher clustering property.

  • 0
    Hmm, you can do actual clustering of sequences say by using k-means algorithm and compare the objective values for various k's. For example, if we take your example and run k-means with k=2, the objective value obtained for first sequences is 9.611687812379864E-4 and for second sequence is 0.1847530718336484. The objective value for first is lower than the second implying that the first sequence is much better clustered.2012-05-27

1 Answers 1

1

The comment was too short, so I am posting this as an answer.

Well, clustering is a tough nut to crack, but there are some approximate ways. Let $b_n = a_{n+1}-a_{n}$, then any function that measures closeness of $(b_n)$ to zero would be more or less what you are looking for. For example take a histogram $(c_n)$ of $(b_n)$ and calculate $\sum_k \frac{c_k}{2^k}$ (it is easy to create different but similar formulas, you should pick one yourself that suits your sequences the best).

For another example take variance: for two sequences ($b_n$s, sequences of differences) with the same mean, low variance would mean constant, steady grow (no clustering) and high variance would mean necessarily some clustering. Also, you might want to look at higher moments, e.g. kurtosis, but I have no idea if that will be relevant to sequences you consider.

Finally, you could calculate the probability of your sequence in some probability space. For example you could assume some distribution with mean value being the mean of your sequence with condensation around it reflecting your idea of clusters.

I know that what I wrote is very vague, but I hope that it will give you some ideas. Cheers!