5
$\begingroup$

given the mahalanobis distance:

$D_M^2(x) = (x-\mu)^T S^{-1}(x-\mu)$

how can I obtain the probability of $x = ( x_1, x_2, x_3, \dots, x_N )^T$ belonging to the data set given by covariance matrix $S$ and mean vector $\mu = ( \mu_1, \mu_2, \mu_3, \dots , \mu_N )^T$? If sample count is needed this is denoted $m$.

I would like something I can use in a computer algorithm.

Related to this I could ask how to obtain the hyper-ellipsoid that defines the confidence interval for e.g. 95%?

  • 0
    I don't understand the question. You need the probability that a certain point belongs to a certain data set? Also, are you assuming some distribution, like a multivariate normal distribution?2012-05-09
  • 1
    As for your second question, I believe the $\chi^2$-distribution will be helpful.2012-05-09
  • 0
    I am assuming a multivariate normal distribution, yes.2012-05-18
  • 0
    I think the question is very clear and I would also love to see a good answer. Let's say I have a GMM of some data and would like to be able to know if a random point x is represented in that model. Let's say I have the statistics for 20 gausian distributions that are mixed to represent my data. I can use Mahalanobis to have a good idea of the distance between my point and each of the 20 centroids. But I'd prefer to have the probability of that point belonging to my model distribution. ??2012-09-25
  • 1
    I agree the question is very clear and would also like a clear mathematical answer. Given that HarryMath is referencing Mahalanobis distance, it follows that he is using multivariate data with a Gaussian assumption. The Mahal distance is the number of std that a point is from the center of a cluster. Therefore the question is: given cov(cluster), and Mahal distance to a point, what is the probability that the point is in the cluster? I think the p(x=C) is simply 1-cdf(MahalD). Like to have it verified.2013-11-15
  • 0
    @JerryGregoire this is correct. With center of cluster being the mean $\mu$ of the cluster.2013-11-17

2 Answers 2

1

The question is still worded poorly. I am assuming what you mean is that you have a data point and you want to know if it could come from a specified multivariate normal distribution. If S and mu are sample estimates from a data set I am interpreting "belongs to the data set" as meaning if the data set estimates define the multivariate normal distribution I am interested in. Under my interpretation the Mahalanobis distance determines ellipsoids of constant probability for that multivariate normal. A fixed distance d defines a probability ellipsoid. Choose d large enough so that the ellipsoid contains say 95% of the probability distribution. If the point has distance less than that d it is inside the ellipsoid and can reasonably be assumed to come from the multivariate normal (at least you don't have strong evidence that it does not). For a distance greater than d you conclude that it more likely is not from the specified normal. Now the question is to get d. I will sketch the approach as this looks like it could be a homework problem in a course on multivariate analysis. The idea is that S should be a positive definite symmetric matrix and hence is diagnalizable. So there exists a matrix that will transform the coordinates such that the normals that are linear combinations of the original correlated normals will be uncorrelated and hence independent. The rescaling these normals will give them all the same variance. Thus the ellipsoid will be transformed to a sphere. The squared sum of these random variables ) (normalized to all have variance 1) will have a chi square distribution with p degrees of freedom where p is the dimension of the multivariate normal distribution (this depends on the assumption that S had full rank which comes from the assumption that it is positive definite). The sum of squares that determine the value of the chi-square can be directly calculated from the Mahalanobis distance d for your point. The function is determined by the transformations that were used. you compare the value r which is a function of d to the critical value of the chi square to get your answer. Linear algebra or multivariate analysis books will have this to help you with an algorithm for this. You should find this in Anderson's or Mardia's multivariate book and the orthogonalization is there or in almost any linear algebra book. I am sure that many mathematics computer packages can do all of this for you except for the chi square critical value which you can get out of an Excel function or a statistical table.

0

Say you have 3 classes, and mahalanobis distances

D1^2 = (x−μ1)T S1^−1 (x−μ1)
D2^2 = (x−μ2)T S2^−1 (x−μ2)
D3^2 = (x−μ3)T S3^−1 (x−μ3)

NF (Normalization Factor) = e^(-D1^2)+e^(-D2^2)+e^(-D3^2)

P1 (Probability of belonging to collection 1) = e^(-D1^2)/NF