0
$\begingroup$

Given a set of $2$ Dimensional independent data, how can I eliminate outliers to perform more accurate clustering? I have considered calculating the IQR for each set individually, and then seeing if a point is an outlier on either the $x$ or the $y$ set, but it doesn't cover all cases. Take this graph for instance: Dataset with outlier The point clearly doesn't fit the pattern and would ruin a good cluster analysis, but it is neither an outlier in the x or y set.

1 Answers 1

2

Comment. I am confused about what you mean by '2-D independent data'.

The example you show has points (except one) exactly on a line. In such linear (or nearly linear) cases you might try to eliminate points that have large residuals from the regression line (or from the principal axis of a cluster).

However, this is a very artificial example. My experience in practice with regressions on actual data is that it is easy to remove a few points that have large residuals from a model. But then if you do a new regression with the remaining points, you will find another set of 'annoying' outliers. And so on. At some point this 're-modeling and deletion' has to stop or you will delete all your data.

More generally, you need to settle on a clear definition of outlier, find a metric that matches that definition, then delete points with large values of the metric. For sequential approaches, you need a sensible stopping rule for 'outlier' removal.

It is useful to realize that not all 'outliers' need to be removed. In some settings they may be the most informative points.

(a) In a set of data on earthquakes, the only ones of interest to the general public are the outliers. Regions that are susceptible to large devastating quakes usually have a continual 'undercurrent' of small quakes, mainly detectable only with sensitive instrumentation.

(b) In some emerging technologies, such as those for detecting genetic anomalies in DNA, all of the 'signal' may be in the outliers, and the stuff that is easy to model is almost entirely the 'noise'.

(c) There is a story (possibly rumor) that the ozone hole over the south pole was discovered only belatedly because monitoring equipment systematically screened out early warnings as 'useless outliers'.

Maybe the title of your post should be "Understand Outliers in Multivariate Data."