1
$\begingroup$

The topic/problem is related to hashing for data structures used in programming, but I seek formal treatment. I hope that by studying the problem I will be enlightened of the fundamental limitations of hashing. So, the purpose is educational.

Let's consider a family of functions $f_i$ which will be applied on a random variable $X$, with unknown weight function $w$, drawn from some finite set $P$.

$f_i: P \rightarrow Q$

$\Vert P\Vert=n>\Vert Q\Vert =m$ (therefore $f$ is irreversible)

The entropy for $X$ is:

$H(X) = q$

What is the function $f_i$ that maximizes

$\min_w \{H(f_i(X))\}\qquad(\max)$.

Clarification: The objective function we are trying to maximize is the minimum entropy for $f_i(X)$, taken over the various possible weight functions $w$ for $X$.

To paraphrase, what is the transformation that promises to preserve the most entropy in its output, regardless of the input's distribution (, as long as the entropy of the input is some fixed $q$.) The question can also be raised with the ratio

$\min_w \left\{\dfrac{H(f_i(X))}{H(X)}\right\}\qquad(\max)$,

or the difference

$\min_w \{H(f_i(X)) - H(X)\}\qquad(\max)$,

and then $q$ can be permitted to vary. I suspect there will be subtleties that I am not trained well enough to see (; such as $q=0$).

The asymptotic behavior is interesting - when $P$ is much larger than $Q$ and $Q$ is sufficiently large. That is, $\frac n m\to+\infty$ and $m\to+\infty$. Particularly, finding a sequence of functions that tends to preserve the entropy completely as $m$ and $n$ grow indefinitely, or at least preserves it up to $\max(q, log_2m)$, would have great significance. If there are no such functions, is there a proof of that being the case?

I will be happy to have such result for the continuous case, if it simplifies things.

Also, I am interested to find literature that performs comparative study of the effect of certain well-known functions (modulo, division, etc.) on the entropy of their input, based on the general properties of the input's distribution (, not bound to particular distributions.)

Post-graduate CS level. Not the most math savvy, but I have passed introductory course to probability and statistics, and to information theory.

Thanks and Regards,

  • 0
    Edit: size of P is greater than Q. Wrong direction of the inequality in my post.2012-06-11

1 Answers 1

1

You won't find any very good solution that maximizes the minimal output entropy over all possible probability distributions for $X$. No matter which $f$ you decide on, the adversary could then choose a probability distribution that concentrates all of the entropy of $X$ in values that map to the same $Q$ value -- for a resulting entropy of zero. This makes your ratio of entropies rather uninteresting.

It works slightly better to ask about the amount of lost entropy. In the worst case where the output entropy is 0, the input entropy cannot be greater than the log of the largest number of $p$ values that go into the same $q$ bucket. So in order to minimize the lost entropy you should choose an $f$ that maps either $\lfloor n/m\rfloor$ or $\lceil n/m\rceil$ different $p$s to each $q$. Any deviation from this will give the adversary a chance to make you lose more than $\log\frac nm$ entropy.

However, that's the most specific advice you're going to get out of the problem as formulated, since you have abstracted away all structure of $P$, $Q$, and $X$ except for their cardinality.

What one does in practice is to make further (often implicit) assumptions about $X$ and its probability distribution. In the pessimistic extreme, if only $X$ is generated by some stochastic process that does not involve inordinate amounts of computation, then one should hope that a good cryptographic hash function would mix up things enough to extract all of the entropy in it that $Q$ has room for. (Because if it didn't, then because of that the hash function wouldn't be "good" after all).

  • 0
    Then, the best I could do is to study the state of practice more carefully. Digest through related fields later, if time and skill permit me. Well, I have been warned. Thanks for the response.2012-06-11