0
$\begingroup$

I have a hard time understanding the meaning of coefficient $\frac{1}{nh}$ in KDE formula: $$\hat f_h(x) = \frac{1}{nh}\sum_{i=1}^nK\left(\frac{x-X_i}{h}\right),$$ where kernel is: $$K(y)=\frac{1}{\sqrt{2\pi}}exp(-\frac{y^2}{2})$$

What is the meaning/goal of this coefficient? I have a feeling that $\frac{1}{n}$ is something like uniform weights, but what's the point of $\frac{1}{h}$?

Thank you.

  • 0
    $h$ can be interpreted as a bandwith, like how close around your data points you are looking. Unfortunately I haven't mastered this theory yet either, so I'm afraid I can't help much more.2017-01-03
  • 0
    @Sanderr, thanks for your reply. I understand the meaning of _h_ itself, but I don't understand why we divide the sum by that bandwidth.2017-01-03

1 Answers 1

1

I always interpreted the formula as consisting of two parts. Given a kernel $K$ (in your case Gaussian but others are possible) you can define a rescaled version, say $\dot{K}$, as \begin{align} \dot{K}(x) = \frac{1}{h}K\left(\frac{x}{h}\right). \end{align} As you see changing $h$ therefore effectively changes the width of the kernel (and $\dot{K}$ is still a proper density function). If your kernel $K$ has a finite support, say $[-1,1]$, changing $h$ will change which points get zero weight (as you change the support of $\dot{K}$) and which points get a positive weight. As the Gaussian kernel is supported on $\mathbb{R}$ there is no hard cut-off in this case (so the effect is maybe less clear). The second part of the formula is a simple average. With $\dot{K}$ we can rewrite $\hat{f}_h(x)$ as \begin{align} \hat{f}_h(x) &= \frac{1}{n} \sum_{i=1}^n \dot{K}(x - X_i)\\ &= \frac{1}{n} \sum_{i=1}^n \frac{1}{h}K\left(\frac{x - X_i}{h}\right)\\ &= \frac{1}{nh} \sum_{i=1}^n K\left(\frac{x - X_i}{h}\right). \end{align} As you see the $(nh)^{-1}$ part is partly due to averaging $\dot{K}$ at the points $x-X_i$, and partly due to the bandwidth $h$ which will determine the weight that points far away from $x$ receive.

  • 0
    Thanks for your answer! I understand the meaning of x/h (number of standard deviations we are away from mean of distribution or z-score). Bigger h, smoother graph is. But I cannot understand why do we multiply K by 1/h in rescaled version. The bigger h the fewer influence of this particular instance of Gaussian graph?2017-01-03
  • 0
    We need to multiply by $1/h$ to guarantee that $\dot{K}$ integrates to $1$ for every value of $h>0$, i.e. that $\dot{K}$ is a valid density function for every $h$.2017-01-03
  • 0
    Maybe that was not clear in my answer, but it is necessary to require that $\dot{K}$ is a density function (i.e. integrates to $1$) to guarantee that $\hat{f}_h$ will be a valid density function: \begin{align} \int_{\mathbb{R}} \hat{f}_h(x) dx = \frac{1}{n}\sum_{i=1}^n \int_{\mathbb{R}} \dot{K}(x) dx = \frac{1}{n} \sum_{i=1}^n 1 = 1. \end{align} The scaling $\tfrac{1}{h}K(\cdot/h)$ is the same is in a location-scale model (just to give another perspective).2017-01-03