2
$\begingroup$

I am going through a book on statistical learning and ran into a problem concerning k nearest neighbor methods. The book says that using least squares to determine optimal k will lead to $k=1$. I tried proving this but am not sure how to differentiate the summation part i.e $\sum_{x_{i} \epsilon N_{k}(x)} Y_{i}$ since the k is involved in the summation.

To help me out could someone show me how to differentiate the kNN formula itself with respect to k please.

Thanks! PS the formula for kNN is:

$\hat{Y}(x)=\frac{1}{k}\sum_{x_{i} \epsilon N_{k}(x)} Y_{i}$

  • 0
    Could you tell what $Y_i$ is?2017-02-28
  • 0
    @Med, the formula predicts the value of y given $x$ by looking at the k closest $x$ and their corresponding $Ys$ , which are the Yi.2017-02-28
  • 0
    I am not sure. But I think you need to have more assumptions and conditions to be able to proceed. Specially, there should be an analytical way (like a function) to find the labels (classes) of the neighbours of an input $x$. The formula, that you have provided, is very general.2017-02-28
  • 0
    What reasonable assumptions do you have in mind. Could you post a solution with assumptions yo make?2017-03-03

1 Answers 1

0

Here is crude information of what I have in mind. Note that it might be one way of formulating the problem.

Firstly, note that a real machine learning problem is a discrete problem, in the sense that some data points, with the corresponding labels (classes) are given to you and you have to guess the label of a new data point, according to the rule

$$\hat{Y}(X)=\frac{1}{k}\sum_{X_{i} \epsilon N_{k}(X)} Y_{i}$$

So, given is labeled data points $T=\{(X_i,Y_i)|X_i \in \mathbb{R}^n,Y_i \in\{0,1,...,c-1\}\}$, where we have $c$ classes.

At this point, we would like to define a function $f:\mathbb{R}^n \rightarrow\mathbb{R}$, which is continuous, such that the floor of the function ($\lfloor{f(X)}\rfloor$)gives the correct classes, when inputs $X_i$ from the set $T$ is given as input to it.

For any optimisation, you need to define a cost function. One can be defined as below.

$$Cost=\sum_{i}(\hat{Y}(X_i)-\lfloor{f(X_i)}\rfloor)^2=\sum_{i}(\frac{1}{k}\sum_{X_{i} \epsilon N_{k}(x)} Y_{i}-\lfloor{f(X_i)}\rfloor)^2$$

In order to use calculus to analyse the problem, we need to make further modifications.

  1. Firstly, the set $T$, which is finite in practice, should be defined for all the inputs $x_i \in \mathbb{R}^n$.
  2. The second thing is to get rid of the discontinuity that is produced by the function $\lfloor{f(x)}\rfloor$. To that goal, we allow any number in $\mathbb{R}$ to be a possible class. In other words, we have an infinite number of classes now. Therefore, we do not need the floor function anymore.
  3. Last thing is to have a density defined on data space. For simplicity, have the distribution to be uniform. The result would be that the cost function can be written as below.

$$Cost=\int_{X \in \mathbb{R}^n} \bigg(\frac{1}{V(S_r(X))}\int_{T \in S_r(X)}f(T)dT -f(X) \bigg )^2 dX$$

$S_r(X)$ is the open sphere, centred at $X$ with radius $r$. $V(S_r(x))$ is the volume of the open sphere. $V(S_r(x))$ is supposed to be the continuous counterpart of $k$. Also, we have $V(S_r(x))=\alpha r^n$, where $\alpha$ is a constant.

$$V(S_r(x))=\alpha r^n$$

We may rewrite the cost function to get.

$$Cost=\int_{X \in \mathbb{R}^n} \bigg( \frac{1}{\alpha r^n} \int_{T \in S_r(X)}f(T)dT -f(X) \bigg )^2 dX$$

The term $\int_{T \in S_r(X)}f(T)dT$ can be approximated by

$$\int_{T \in \mathbb{R}^n} \delta (X,r)f(T)dT$$

where $\delta(X,r)$ is Gaussian distribution with mean $X$ and variance $r$ and $r=d(T,X)$ is the Euclidean distance between $X$ and $T$. As $r$ goes to $0$, the integral goes to $\alpha r^n f(X)$.

According to the last two equations, you should come to a conclusion that for $r=0$ you get the minimum. Note that $r=0$ is equivalent to $k=1$, which means that you want to consider only the point $X$ itself for evaluation of the label $Y(X)$.

You can also take the derivative of the cost function, with respect to $r$ and put it equal to $0$, which is a tedious process.

  • 0
    Where does the volume function come from? and note it is still possible to use kNN for continuous response variables. This would reduce those 3 modifications you make.2017-03-03
  • 0
    But I have to appreciate the elegance of this method!!2017-03-03
  • 0
    also could you put a link to an article, other question or book that has followed a similar procedure to convert a discrete problem to continous2017-03-03
  • 0
    I do not have a reference for it. Just used what I knew previously. The proof is not accurate. I just tried to show that you should have more assumptions to be able to perform a formal optimisation. Also, the volume function is in direct relation with $k$. The larger the volume, the more prototypes you have in the neighbourhood.2017-03-03