5
$\begingroup$

I'm a programmer, not a math guy, so please answer in English. ;)

Suppose I have a multi-modal univariate distribution like:

.. . ..                         ...........                    .. . .. .

but with each "cluster" (where each clusters is normally distributed) much further apart and more clusters. If I do a density plot of this with R, it's going to be spiky, but some of the less dense spikes might not be smooth because the "optimal" bandwidth was dominated by the more dense clusters.

Compare to a unimodal distribution like:

.          .         . ..    .  . . ... .. . .      .. .    .   .      .

The density plot of this distribution would look just fine.

What property describes the multi-modality of a distribution? I'm pretty sure that the former distribution would be better modeled by separating each cluster into a separate distribution and doing a density plot on it separately. But I'm unsure how to separate the distribution into these clusters robustly.

  • 1
    It's NP hard to actually figure out a good k-means approximation. One idea would be to help k-means by doing an initial peak detection algorithm. If you convolve your distribution with a sliding box function, it will effectively average out values at each point in the interval of the box. If you then set a threshold value for the set of averages, you'll effectively be pulling out the peaks.2012-09-18

1 Answers 1

0

If I understand your question, you have a set of data points that represent a single variable (height measurements of people say) and you outline two scenarios, one where the data have one mode, the other multiple modes. When plotted against another variable, datapoints near different modes will tend to appear in different groups or clusters, and typically you want to determine which datapoints belong to which groups. In the height example, a plot of height on the y-axis versus weight on the x-axis may show distinct clusters of people which to a large extent explain the multi-modality.

Somebody mentioned k-means clustering to determine which datapoints belong to which cluster and this can be used. Another method is model-based clustering using mixture models – finite mixture models if the densities are normal. These models can be fitted in R, and they use likelihood functions to fit the models and perform the clustering (determining which datapoints are in each cluster). As far as I know one method for determining if 1,2,3,.. clusters are present in the data is to fit sequential finite mixture models using 1,2,3.. clusters and to compare the overall likelihoods. Usually measures called AIC and/or BIC are used to do the comparisons, and these correct for the fact that the different models may use different numbers of parameters. As I understand things, statistical tests for these comparisons are not valid (i.e. we cannot make probability statements about the comparisons, rather we simply compare the AIC/BIC).

Have a look at the paper “Finite mixture models and model-based clustering” by Melnykov (2010) for a good summary. You can find it at http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ssu/1272547280.