0
$\begingroup$

Background Information:

Starting with the sample $X_1,\ldots, X_{N}$ and sort the sample so that $X_1\leq X_2\leq \cdots \le X_N$. In our case the data set $x_1 = 0.2$, $x_2 = 0.6$, $x_3 = 0.7$. Suppose $X\sim \mathcal{U}(0,1)$ then the cumulative distribution function for $X$ is $$F(x) = x $$ We have $$D_N = \sum_{-\infty < x < \infty}|F_N(x) - F(x)|$$ where $$F_N(x) = \begin{cases} 0 \ &\text{if } x < X_1\\ k/N \ &\text{if } X_k\leq x < X_{k+1}\\ 1 \ &\text{if } x > X_N \end{cases}$$ The first and last terms are $$\sup_{x < X_1}|-F(x)| = F(X_1)$$ $$\sup_{x > X_N} |1 - F(x)| = 1 - F(X_N)$$ For the other terms, observe that $$\sup_{X_k\leq x < X_{k+1}}\left|\frac{k}{N} - F(x)\right| = \max\left(F(X_{k+1}) - \frac{k}{N},\frac{k}{N} - F(X_k)\right); k = 1, \ldots, N - 1$$

$D^{+}$ and $D^{-}$: $$\begin{aligned}D_{N}&=\max\left[F(X_{1}),\max_{k=1,\ldots,N-1}\left(F(X_{k+1})-\tfrac{k}{N},\tfrac{k}{N}-F(X_{k}),1-F(X_{N})\right)\right]\\ &= \max\left[\underbrace{\max_{k=1,\ldots,N}\left(\tfrac{k}{N}-F(X_{k})\right)}_{D^{+}},\underbrace{\max_{k=1,\ldots,N}\left(F(X_{k})-\tfrac{k-1}{N}\right)}_{D^{-}}\right]\\&=\max\{D^{+},D^{-}\}\end{aligned}$$ The formulas simplify if $X\sim\mathcal{U}(0,1)$ since then $F(x)=x$.

Question:

Compute $D_N = \max\{D^{+},D^{-}\}$ for the data set $x_1 = 0.2$, $x_2 = 0.6$, $x_3 = 0.7$. Take $F$ to be the c.d.f. of $U(0,1)$; the uniform distribution on $(0,1)$. (Do these computations by hand - no computer code.) What do you think $D^{+}$, $D^{-}$, $D_N$ measure, intuitively?

Attempted solution - So in our case when $x_1 = 0.2$ then $$D_1 = F(x_1) = 0.2$$ and $$D_3 = F(x_3) = 0.7$$

I am not sure how to get $D_2$, or if I am doing this correctly or not. Any suggestions are greatly appreciated, I am not really sure about the intuition on the last part of the question.

  • 0
    I it's a good idea to plot the empirical CDF $F_3(x)$ on the same plotting with the theoretical CDF $F(x)$, then the problem reduces to simple geometry. This geometry also could provide the way to answer to the question about intuitions, at least from the geometrical point of view.2017-02-05
  • 0
    @AlexanderRodin could you provide a solution if you know this?2017-02-05
  • 0
    @AlexanderRodin started a bounty in case you are interested2017-02-07
  • 1
    The CDF of a standard uniform distribution $X\sim U(0,1)$ in the $(0,1)$ range is $F(x) = x $. Why do you set $F(x) = x -1$?2017-02-08

1 Answers 1

3

I am somewhat unsure about one detail in your question: in the standard KS test, the $D_{N}$ quantity is defined as $D_{N}=\sup_{x\in\mathbb{R}}|F_{N}(x)-F(x)|$. I think the expression you give is a typo in MathJax from "\sup" to "\sum". My answer below is for "\sup".

With this, your $F_{N}$ is 'empirical' cdf calculated from your data and $D_{N}$ is the largest absolute difference between the empirical cdf and the theoretical cdf. (Your $D_{1}$ and $D_{3}$ make no sense to me since there are $3$ observations, that is, $N=3$.)

What is the empirical cdf? It is, substituting your data into your definition of $F_{N}$: $$F_{N}(x)=\begin{cases} 0&\text{ if }x<0.2\\ \frac{1}{3}&\text{ if }x\in[0.2,0.6)\\ \frac{2}{3}&\text{ if }x\in[0.6,0.7)\\ 1&\text{ if }x\geq0.7\\ \end{cases}$$

What the question asks you to calculate is $D_{N}$. (The hint about the first and the last term are supposed to help in that.) Let me try to help with the figure below (its straightforward to draw it by hand, since you are prevented from using code). The blue line is the theoretical cdf, the magenta line is the empirical cdf and the yellow line is the absolute difference between your two. Your $D_{N}$ is the $\sup$ of this function. I think it is $0.3$ at $x=0.7$.

enter image description here

How to get at $D_{N}=0.3$? More systematic approach is to calculate the $D^{+}$ and the $D^{-}$. You have $D^{+}=\max_{k=1,\ldots,N}\frac{k}{N}-F(X_{k})$ and $D^{-}=\max_{k=1,\ldots,N}F(X_{k})-\frac{k-1}{N}$. What are these? Both $D^{+}$ and $D^{-}$ focus on the difference between the empirical and the theoretical cdf at the data values (since this is where the difference is the largest). At each data point, the empirical cdf jumps up and hence has "lower" value in the left neighborhood of the data point and has "higher" value in the right neighborhood of the data point. $D^{+}$ is the difference between the higher value of the empirical cdf and the theoretical cdf. $D^{-}$ is the difference between the lower value of the empirical cdf and the theoretical cdf. Loosely speaking, $D^{+}$ is the largest positive difference between the two cdfs and $D^{-}$ is the largest negative difference.

With the help of the figure, substituting into the definition, one can get: $$\begin{aligned} D^{+}&=\max\{\tfrac{1}{3}-0.2,\tfrac{2}{3}-0.6,\tfrac{3}{3}-0.7\}\approx\{0.13,0.067,0.3\}\\ D^{-}&=\max\{0.2-\tfrac{0}{3},0.6-\tfrac{1}{3},0.7-\tfrac{2}{3}\}\approx\{0.2,0.26,0.03\}\\ \end{aligned}$$

  • 0
    I made some edits to my question sorry for the typos2017-02-08
  • 0
    @Snarf No problem with the typos. Don't worry. I made some edits to my answer. I hope it is clear now.2017-02-09