Let $X$ be a discrete random variable with Laplacian distribution with mean $0$ and scale $\lambda$, as
$$ p(X) = \frac{1}{2\lambda} \exp\left(-\frac{|x|}{2\lambda}\right), \\ X \in \{0,1,...,255\}. $$
Suppose we have $N$ observations $\{x_1,x_2,...,x_N\}$ for $X$ and want to estimate the $\lambda$. One solution is to use Maximum likelihood method which simply estimates $\lambda$ as the mean of $x_i$.
My confusion is about another solution employed in a C++ code. Apparently the implemented approach is to estimate $\lambda$ such that minimizes the conditional entropy. Here is the this code:
You can skip this code. I have represented it as a mathematical expression in the following
double BestLambda, BestEntropy;
BestEntropy = LARGE;
for (double lambda = 0; lambda <= 10.0; lambda += 0.01) {
    // Form Q vector
    double sum = 0.0;
    for (int i = 0; i < 255; i++) {
        pQVec[i] = exp( -lambda * abs(i) );
        sum += pQVec[i];
    }
    //Normalize pQVec
    for(int i = 0; i < 255; i++)
    {
        pQVec[i] = pQVec[i] / sum;
    }
    // Calculate conditional entropy
    double EntropySum = 0.0;
    for (int i = 0; i < 255; i++) {
        if (XFreq[i] > 0) {
            EntropySum -= (XFreq[i] * log(pQVec[i])/log(2.0));
        }
    }
    // Compare against best entropy
    if (EntropySum < BestEntropy) {
        BestEntropy = EntropySum;
        BestLambda = lambda;
    }
}
In the mathematical terms, this program estimates $\lambda$ in this way,
$$ \DeclareMathOperator*{\argmin}{\arg\!\min} \hat\lambda = \argmin_{\lambda} \left\{ - \sum\limits_{i=1}^{255} f(x=i)\log_2(p_{\lambda}(x=i)) \right\}, $$
where $f(x=i)$ is the frequency of $x$'s with value $i$ in the samples and $p_\lambda(x)$ is a Laplacian distribution with mean 0 and scale $\lambda$.
While this code is implemented and published by the guys at Stanford and supported by their highly-cited paper, I have doubt if their approach in estimating $\lambda$ is accurate. I have studied a little bit on Information Theory and cannot figure out what we are actually minimizing in this expression.
Shouldn't we minimize the distance between two distributions? Using Kullback–Leibler divergence we could estimate $\lambda$ as
$$ \hat\lambda = \argmin_{\lambda} \left\{ D(f(x)||p_{\lambda}(x) ) \right\} = \argmin_{\lambda} \left\{ \sum_{i=1}^{255} \left[ f(x=i)\log_2\left(\frac{f(x=i)}{p_{\lambda}(x=i)}\right) \right] \right\} . $$
