1
$\begingroup$

I'm looking for help explaining the distribution of leading digits in integers that have been converted from randomly generated hexadecimal values.

If I construct a set of one million random hexadecimal values via a SHA256 hash in python (having already import hashlib):

# Returns a 64 character random hex value based on some seed
def hash(data): return hashlib.sha256(data).hexdigest()

# Get a testRange size set of random hexadecimal values
testRange = 1000000 # test with 1 mil
randomHexes = [hash(str(x)) for x in xrange(testRange)]

And look at the distribution of leading digits via this rather crude function:

def showDistribution(dataSet, leadingEntries):
    size = float(len(dataSet))
    for l in leadingEntries:
        matches = [x for x in dataSet if len(str(x)) > 0 and str(x)[0] == str(l)]
        percentage = (len(matches)/size) * 100 
        print("{0}: {1}%".format(l, percentage))

# The possible values that could be a hex leading character
hexValues = map(str, range(1,10)) + ["a", "b", "c", "d", "e", "f"]
print("Distribution of hex leading character:")
showDistribution(randomHexes, hexValues)

The leading digit character is evenly distributed, as expected.

If I then convert those hex values to integers, and examine the leading digit in each of them:

integerVersion = [int("0x" + h, 16) for h in randomHexes]
print("\nDistribution of integer leading digit:")
showDistribution(integerVersion, map(str, range(1,10)))

I get a much higher percentage of values whose leading digit is 1. Output:

Distribution of integer leading digit:
1: 23.2646%
2: 9.5473%
3: 9.5823%
4: 9.57%
5: 9.5738%
6: 9.638%
7: 9.6136%
8: 9.5718%
9: 9.6386%

Now if I look at the distribution of the hex values that ended up mapping to an integer whose leading digit is 1:

matching = [randomHexes[i] for i in xrange(len(randomHexes)) if str(integerVersion[i])[0] == "1"]
showDistribution(matching, hexValues)

I get an odd distribution of hex values whose integer start with a 1:

1: 16.6153727122%
2: 20.5436586058%
3: 0.0%
4: 0.0%
5: 0.0%
6: 0.0%
7: 0.0%
8: 0.0%
9: 0.0%
a: 0.0%
b: 0.0%
c: 0.0%
d: 4.86232301437%
e: 27.008846058%
f: 26.8094014081%

Why is this?

1 Answers 1

1

Your hex values are chosen randomly from the range $1$ to $2^{256} \approx 10^{77.06468} \approx 1.158 \times 10^{77}$. Let $h$ be one of your hex values, and look at the value of $d = \frac h{10^{77}}$. Note that $\frac {0.1}{1.158} \approx 0.086 = 8.6\%$.

If $0 < d < 0.1$, then the number will have 76 or fewer digits. The leading non-zero digit will be randomly distributed between the nine non-zero digits, so this will contribute approximately $\frac {8.6\%}{9} \approx 0.96\%$ of occurrences to each of the nine digits.

If $\frac n{10} \le d < \frac {n+1}{10}$ for some digit $n > 0$, then $h$ has 77 digits, and the leading digit will be $n$. This contributes $8.6\%$ of occurrences to each of the nine digits.

If $1 \le d$, then $h$ will have 78 digits, the first of which will be 1. This accounts for about $\frac {0.158}{1.158} \approx 13.6\%$ of all occurrences.

So, we have a $0.96\% + 8.6\% + 13.6\% \approx 23.2\%$ chance that the leading digit will be 1, and a $0.96\% + 8.6\% \approx 9.6\%$ chance that the leading digit will be each of the other eight digits.