1
$\begingroup$

I need to design a hash function such that it is possible to enumerate all possible 'candidate' input strings (preimages) for a given hash (image). It must otherwise be a normal hash function [i.e. small differences in input (preimage) --> large difference in hash (image) ]

Does this already exist? If so, what's it called, and if not, how would I go about designing one?

Cheers, a.

  • 0
    By a hash function, I assume you mean a function from the set of arbitrary length strings from a fixed alphabet, to a fixed subset of $\mathbb{N}$? I.e. a function $\bigcup_{n=0}^\infty A^n \rightarrow \{1,\dots,N\}$ where $A$ is a finite set of characters?2012-05-31

2 Answers 2

1

Here's an idea. First define the shift map $f$ by

$$f(n) = 2n\,({\rm mod}\,N)$$

for some large prime $N$. The hash function works as follows:

  • Convert every character to its ASCII value
  • Map the string $s$ to an integer by summing the ASCII values: $g(s) = \sum_{i=0}^{n-1} s_i$
  • Apply the shift map $M$ times

Does this satisfy the common properties of being a hash function?

  • The shift map is ergodic, so it distributes values uniformly throughout the range $\{1\dots N\}$
  • The shift map exhibits sensitive dependence on initial conditions, so small changes in the input can result in large changes in the output (although, see my note later).

However, it is relatively easy to invert the shift map. We exploit the fact that $\mathbb{Z}/N\mathbb{Z}$ is a field for $N$ prime, which means that division is possible. Therefore given a hash value $h$, we compute $g=h/2^M$ in the field $\mathbb{Z}/N\mathbb{Z}$. We now know that any string whose ASCII values sum to a member of the set

$$\{g, g+N, g+2N, \dots \}$$

will be hashed to $h$. The problem is then to enumerate all strings whose ASCII values sum to members of this set, which is a much easier task.

Note: This answer isn't complete. One obvious deficiency is that two strings whose ASCII values sum to the same number will always hash to the same value - therefore we can make small changes to a string that will guarantee a hash collision. In particular, any permutation of a string will result in a hash collision, so "star", "rats" and "tars" will all hash to the same value. However, these are structured edits. A generic edit, such as randomly inserting, deleting or modifying a single character, will in general result in a different hash value.

Edit: One trivial modification that overcomes the above difficulty is to take a different map from String -> Int. Rather than simply summing values, we could consider

$$g(s) = \sum_{i=0}^{n-1}2^ia_i$$

where $n$ is the length of the string. Now, permuting the string will result in a different value of $g(s)$, so permutations will hash to different values. There may still be other problems with the approach, however.

  • 0
    This seems very promising, thank you! Haven't accepted it yet as I need to look at it properly - did *try* to up-vote, but don't have the 'rep' for it! :)2012-05-31
  • 0
    I'm liking this answer more and more... am very grateful for your spelling out of details! Would there be a way to use other information about the input string (e.g. initial size) to prune the enumeration? My outputs will be much smaller than the input, so I need a way to cut down the search space.2012-06-06
0

If your input is the same size as your output, you are looking for a trapdoor function, such as modular exponentiation (RSA). For RSA, since this is a permutation, there is exactly one preimage for each image.

If your input is larger than your output, then obviously the enumeration of candidate inputs is going to be exponential in the difference of sizes. The simplest way to do it is to truncate the output to whatever size you want, and enumerate all the possible values for the truncated part when recovering the preimage.

  • 0
    Many thanks for that. But surely this isn't really sensitive to initial conditions? If I strip off the last X bits, no alteration to them will cause a change in the hash...2012-05-31
  • 0
    You are truncating the *output* and working in a space with the same size as your input. This will of course cause collisions as in any hash function, but you are not ignoring any part of the input. And in case you worry about performance: this is not a problem since significant speedups in preimage enumeration (as compared to brute force) are only possible if the input is not much longer than the output, so you cannot lose too much performance by working in the input space.2012-05-31
  • 0
    I think I'm with you. Would there be a way to use other information about the input string (e.g. initial size) to prune the enumeration? My outputs will be much smaller than the input, so I need a way to cut down the search space.2012-06-06