I'm looking for help explaining the distribution of leading digits in integers that have been converted from randomly generated hexadecimal values.
If I construct a set of one million random hexadecimal values via a SHA256 hash in python (having already import hashlib):
# Returns a 64 character random hex value based on some seed
def hash(data): return hashlib.sha256(data).hexdigest()
# Get a testRange size set of random hexadecimal values
testRange = 1000000 # test with 1 mil
randomHexes = [hash(str(x)) for x in xrange(testRange)]
And look at the distribution of leading digits via this rather crude function:
def showDistribution(dataSet, leadingEntries):
size = float(len(dataSet))
for l in leadingEntries:
matches = [x for x in dataSet if len(str(x)) > 0 and str(x)[0] == str(l)]
percentage = (len(matches)/size) * 100
print("{0}: {1}%".format(l, percentage))
# The possible values that could be a hex leading character
hexValues = map(str, range(1,10)) + ["a", "b", "c", "d", "e", "f"]
print("Distribution of hex leading character:")
showDistribution(randomHexes, hexValues)
The leading digit character is evenly distributed, as expected.
If I then convert those hex values to integers, and examine the leading digit in each of them:
integerVersion = [int("0x" + h, 16) for h in randomHexes]
print("\nDistribution of integer leading digit:")
showDistribution(integerVersion, map(str, range(1,10)))
I get a much higher percentage of values whose leading digit is 1. Output:
Distribution of integer leading digit:
1: 23.2646%
2: 9.5473%
3: 9.5823%
4: 9.57%
5: 9.5738%
6: 9.638%
7: 9.6136%
8: 9.5718%
9: 9.6386%
Now if I look at the distribution of the hex values that ended up mapping to an integer whose leading digit is 1:
matching = [randomHexes[i] for i in xrange(len(randomHexes)) if str(integerVersion[i])[0] == "1"]
showDistribution(matching, hexValues)
I get an odd distribution of hex values whose integer start with a 1:
1: 16.6153727122%
2: 20.5436586058%
3: 0.0%
4: 0.0%
5: 0.0%
6: 0.0%
7: 0.0%
8: 0.0%
9: 0.0%
a: 0.0%
b: 0.0%
c: 0.0%
d: 4.86232301437%
e: 27.008846058%
f: 26.8094014081%
Why is this?