The usual formula for euclidean distance that everybody uses is
$d(x,y):=\sqrt{\sum (x_i - y_i)^2}$
Now as far as I know, the sum-of-squares usually come with some problems wrt. numerical precision.
There is an obviously equivalent formula:
$d(x,y):= c \sqrt{\sum \left(\frac{x_i - y_i}{c}\right)^2}$
Where it seems to be a common practise to choose $c = \max_i |x_i - y_i|$.
For 2d, this simplifies to a formula of the form: $d(x,y):= c \sqrt{1 + \frac{b}{c}^2}$
Some questions here:
- How big is the gain in precision of doing this, in particular for high dimensionalities?
- How much does it increate computational costs?
- Is this choice of $c$ optimal?
To compute $c$, this needs two passes over the data. However, it should be possible in a single pass, by starting with $c_0=1$, and then adjusting it when necessary for optimal precision.
E.g. let $c_0=1$, $c_i=\max_{j\leq i} |x_i-y_i|$. Then $S_i:=\sum_{j\leq i} \left(\frac{x_i - y_i}{c_i}\right)^2 = \sum_{j\leq i-1} \left(\frac{x_i - y_i}{c_{i-1}}\right)^2 \cdot \frac{c_{i-1}^2}{c_i^2}+\left(\frac{x_i - y_i}{c_i}\right)^2 = S_{i-1} \cdot \left(\frac{c_{i-1}}{c_i}\right)^2+\left(\frac{x_i - y_i}{c_i}\right)^2$ This should allow single-pass computation of this formula, right?
Any comments in particular on the computational cost and precision benefits of computing Euclidean distance this way? Why is everybody using the naive way, is the gain in precision too small for low dimensionality and the associated computational cost too high?
P.S. At least to my understanding, the usual formula should be precise up to the value range of sqrt(Double.MAX_VALUE)
to sqrt(Double.MIN_NORMAL)
, which covers around e+-154
, at most divided by the dimensionality - so even for 1000 dimensions, that should be fine for most uses of a distance function ...