6
$\begingroup$

I have some sets of numbers, and I'd like to have a way to talk about how close these sets are to each other. I'm not sure what properties it should have (e.g. does it need to be a metric?). But I hope there are examples of better defined problems like this out there with some good solutions.

Some considerations. {1} should be closer to {1.1} than it is to {2}. And {.9,1.1} should be closer to {1} than {0,2} is. But is {1,1.1} closer to {1} than {2}? I suspect any distance function will have at least one parameter to control the tradeoff between nearness of points and cardinality of the sets.

  • 5
    http://en.wikipedia.org/wiki/Hausdorff_distance2011-09-28
  • 0
    You could consider each set as a uniform probability distribution on a discrete set and consider the [earth mover's distance](http://en.wikipedia.org/wiki/Earth_mover%27s_distance) between them. That article also describes an extension that would be useful if you want to take differences in cardinalities into account.2011-09-28
  • 0
    @RahulNarain Does this really capture the differences in cardinalities well? If I have the sets {1} and {1,1+epsilon}, the EMD will be very small.2011-09-28
  • 0
    @pythonic: I suppose I should have said "uniform *measures*" instead of "uniform *probability distributions*", as they shouldn't sum to unity. If you consider $\mu_1(1) = 1$ and $\mu_2(1) = 1$, $\mu_2(1+\epsilon) = 1$, then the ([modified](http://en.wikipedia.org/wiki/Earth_mover%27s_distance#Extensions)) EMD is $0$ (move $1$ unit of mass from position $1$ to position $1$) plus $\sigma$ (delete $1$ unit of mass from position $1+\epsilon$). This $\sigma$ acts like a tradeoff parameter.2011-09-29

2 Answers 2

6

Let $A$ be a finite subset of $\mathbb{R}$. For $x\in\mathbb{R}$ define $d(x,A)$, the distance from $x$ to $A$, to be $$d(x,A) = \min\{\vert x-a\vert:a\in A\},$$ the distance from $x$ to the closest point of $A$. If $B$ is another finite subset of $\mathbb{R}$, define $d^*(A,B)$, the asymmetric distance from $A$ to $B$, to be $$d^*(A,B) = \max\{d(a,B):a\in A\}.$$ Finally, define $d(A,B)$, the distance between $A$ and $B$, to be $$d(A,B) = \max\{d^*(A,B),d^*(B,A)\}.$$ This is the Hausdorff distance mentioned by deinst.

A few examples: $$\begin{align*}d(\{0.9,1.1\},\{1\}) &= \max\{0.1,0.1\} = 0.1\\ d(\{0.9,1.1\},\{2\}) &= \max\{1.1,0.9\} = 1.1\\ d(\{1,1.1\},\{1\}) &= \max\{0.1,0\} = 0.1\\ d(\{1,1.1\},\{2\}) &= \max\{1,0.9\} = 1\\ d(\{0.9,1.1\},\{1.9,2.1\}) &= \max\{1,1\} = 1\\ d(\{0.9,1.1\},\{1.9,2.0,2.1\}) &= \max\{1,1\} = 1\\ d(\{1.9,2.0,2.1\},\{1.9,2.1\}) &= \max\{0.1,0\} = 0.1 \end{align*}$$

3

There are multiple distances which you might consider in principle. Given that you're considering finite sets, it makes sense to reduce to metrics popular on real vector spaces.

Let $S$ and $T$ be finite sets of real numbers. In the case that $S$ and $T$ have the same cardinality $n$, let $\mathbf s, \mathbf t \in \mathbb R^n$ be vectors whose coefficients are elements of $S$ and $T$ respectively (in any order). Then, for any positive integer $p$ (and for $p = \infty$), you may consider a $p$-distance defined by $$ \min_\Pi\; \| \Pi \mathbf s - \mathbf t \|_p$$ where $\Pi$ ranges over all permutation matrices on $\mathbb R^n$, and where $$\| \mathbf v \|_p := \sqrt[\Big.^{\scriptstyle p}]{\sum_{j=1}^n |v_j|^p\;}$$ (taking the limit as $p \to \infty$ for the norm $\| \mathbf v \|_\infty$). If $S$ and $T$ have different cardinalities, you may vary this (in the case that $S$ is larger) by replacing $\Pi$ with an arbitrary matrix whose columns are distinct standard basis vectors (selecting some subset of $S$ to perform the distance computation with).

Regardless of the dimensions, you could also generalize this to consider a distance of the form $$ \min_{\Pi_S, \Pi_T}\; \bigl\| \Pi_{\!S} \;\mathbf s \;-\; \Pi_{\!T} \;\mathbf t \bigr\|_p$$ where $\Pi_S$ and $\Pi_T$ range over matrices whose columns are standard basis vectors, possibly with repetition; this allows you to compute vector metrics using arbitrary pairs of elements of $S$ and $T$.

Note that as the cardinalities of the sets grow, these quantities are likely to be prohibitive to compute; but the sets of vectors which I describe using these permutation/selection matrices will describe nice convex sets, for which the usual vector metrics make a certain amount of sense, and which may prove amenable to analysis. And if you consider instead the limit $p \to -\infty$, you should encounter the minimum-pair-distance norm described by Brian in his earlier post.

  • 0
    Any reason you didn't just use `\sqrt[p]{...}` ($\sqrt[p]{...}$) in the expression for $\lVert\mathbf v\rVert_p$?2011-09-28
  • 0
    Yes: because for whatever reason, it renders more poorly in this case. At least for me. You appear to have noticed that I didn't do that: does it look strange to you?2011-09-28
  • 0
    @NieldeBeaudrap: on my Firefox screen the $p$ that is supposed to be the root is partly within the radical sign.2011-09-28
  • 0
    In my browser, the superscript $p$ is drawn directly on top of the vertical stem of the radical sign. Another unfortunate example of the web being not like print...2011-09-28
  • 0
    Also, what you are describing is the $p$th [Wasserstein distance](http://en.wikipedia.org/wiki/Wasserstein_metric) (a.k.a. the [earth mover's distance](http://en.wikipedia.org/wiki/Earth_mover%27s_distance) when $p = 1$), isn't it?2011-09-28
  • 0
    I've edited the TeX to be simpler, because slight suboptimality on my browser is trumped by bad display in multiple browsers.2011-09-28