1
$\begingroup$

We have a random variable $x$, with an unknown distribution. We want to find a transformation $y=f(x)$, where $f(x)$ is a monotonically increasing function, and $y$ has Gaussian distribution. We randomly pick $N$ samples of $x$, say $x_1,...,x_N$. How can we define $f$, given the values $x_1,...,x_N$, that if $N$ is large enough, $y$ becomes a Gaussian?

EDIT: I think the question was misunderstood. Let me clarify a little bit. We are given $N$ random samples from an unknown distribution. and we are asked to design a monotonically increasing function $f:\mathbb{R} \to \mathbb{R}$ so that$y=f(x)$ is Gaussian. One approach is to estimate the distribution of $x$ from the $N$ given samples, then map that to $N$ samples from a normal distribution for example $N(0,1)$, then for any given $x$, $f(x)$ can be found by interpolating the mapped values for known samples. Is there a simple way to do this ?

  • 0
    How about considering $ \Phi (y) = {\rm P}(f(X) \le y) = {\rm P}(X \le f^{ - 1} (y)) = F_X (f^{ - 1} (y)), $ where $\Phi$ is the ${\rm N}(0,1)$ distribution function and $F_X$ the distribution function of $X$ (which can be well approximated if $N$ is large enough), leading to $ f^{ - 1} (y) = F_X^{-1} (\Phi (y)). $2011-06-21

1 Answers 1

3

Here is an example of an algorithm: it may not be optimal but it is simple.

Sort the values so $x_1 \le x_2 \le \cdots \le x_n$. If $x = t x_k + (1-t) x_{k+1}$ with $0 \le t \le 1$ then let $y = f(x) = t \Phi^{-1} \left(\frac{2k-1}{2n}\right) + (1-t) \Phi^{-1} \left(\frac{2k+1}{2n}\right)$ where $\Phi^{-1}$ is the inverse of the cumulative distribution function of a standard normal distribution. This is monotonic increasing (strong monotonic if the sample values are all different).

There are other possible algorithms, but like this linear interpolation, they make implicit assumptions about the unknown distribution.

The other problem is extrapolation outside the range of the sample where you really have no information at all. You might do something like this, where knowing the standard deviation of the sample $\sigma$, if $x=x_1-k\sigma$ for positive $k$ then let $y = f(x) = \Phi^{-1} \left(\frac{1}{2n}\right) - k$, and if $x=x_n+k\sigma$ then let $y = f(x) = \Phi^{-1} \left(\frac{2n-1}{2n}\right) + k$. This too is monotonic increasing. But it is arbitrary.

In fact any algorithm will be arbitrary. And it often an error to to normalize data, since by doing so you are losing distributional information from the sample.