5
$\begingroup$

This ties back to a SO question, where someone wanted to increase performance of an algorithm, and part of the solution was "test the most likely situation first". The question involved checking for the presence of one or more of a particular subset of base-10 digits, and so the answer involved first testing that each digit wasn't in the subset, because that was more likely.

This got me thinking. In the infinite set of all natural base-10 numbers, is any single digit more prevalent than any other? And a corollary: is the answer to that question different for a different number base (such as binary, octal, or hexadecimal)?

The common-sense answer to both questions would be "no". However, a simple exercise in binary numbers would demonstrate that the digit 1 is more common than the digit 0 in the significand of a fixed set of binary digits:

0 1 10 11 //maximum value of 2 bits; there have been 4 1s and 2 0s. 100 101 110 111 //maximum value of 3 bits; 12 1s and 6 0s have occurred. 1000 1001 1010 1011 1100 1101 1110 1111 //maximum value of four bits; 32 1s and 18 0s have occurred. 

I know enough about math to know the common-sense answer is often wrong; for instance, trans-infinite numbers defining differing cardinalities of "infinite" sets. There are also pseudo-rules that look like a sure thing until the problem space becomes large; you can already see just through four bits that, while 1 is more common than 0, 0 is occurring more often as the number of bits in the significand increases. So, I thought I'd ask and see if anyone else could prove or disprove it in the general case.

  • 3
    That depends on what you mean by "more."2012-02-23
  • 8
    Sounds a little like [Benford's Law](http://en.wikipedia.org/wiki/Benford%27s_law).2012-02-23
  • 0
    @draks: good reference. However, that article says it isn't always true.2012-02-23
  • 2
    Your listing has hidden some $0$'s. For example, the byte $00001101$ would represent "thirteen." In your count, you eliminated these leading $0$'s. The lack of uniformity is entirely due to this sort of thing. In the same way, in the numbers $0$ to $9999$, there aren't "enough" $0$'s, but the rest of the digits are fine.2012-02-23
  • 1
    But I state that I'm talking about the set of all natural numbers. Natural numbers are neither said nor written with leading zeroes, and so I structured the example in a similar way.2012-02-23
  • 0
    @André: That's only true if the values are uniformly distributed up to a power of two. The lack of uniformity of digits that occur in practice comes about because numbers are rarely distributed like that.2012-02-23
  • 0
    @Keith: "that article says it isn't always true": The article talks about the real-life distribution of digits. Your question doesn't make sense as stated because, as Qiaochu points out, you haven't defined what you mean by "more" in the context of *all* natural numbers. However, if you're interested in the distribution of numbers actually occurring in practice, then Benford's Law is often quite relevant.2012-02-23
  • 0
    @joriki: The OP writes "in the set of all natural numbers," and not "in the set of naturally occurring numbers."2012-02-23
  • 1
    Besides, if you DO count leading zeroes, then for any finite N binary digits there would be exactly N(N-1)/2 zeroes that are not part of the significand (the space in which 1 cannot occur by definition), and less than N(N-1)/2 ones in the significand (because zeroes also occur in the significand), which would prove that zero occurs more often than one. However, this is flawed; natural numbers are not stated in terms of a maximum possible order of magnitude; only the maximum significant order of magnitude for the specific number.2012-02-23
  • 1
    @KeithS: It is fine if you do not want to count leading $0$'s. But, at least for numbers in the interval $0\le i\le 2^k-1$, the total number of $0$'s is exactly the same as the total number of $1$'s, by a symmetry argument. Small example, $k=3$. We have $000$ paired with its $1$'s complement $111$, $001$ paired with $110$, $010$ paired with $101$, and so on.2012-02-23
  • 0
    except that 000 isn't 000, it's 0, and 010 isn't 010, it's 10. That's my point; the number of digits occurring in all natural binary numbers M that can be expressed in any N digits is not equal to the space N*M (it would be the area under the curve log2(X) over 0<=x<=M, such that log2(M)==N), so whatever symmetry occurs in the rectilinear space N*M does not indicate symmetry in the space bounded by log2(x).2012-02-23
  • 0
    @KeithS: One thing to keep in mind is that it's important to distinguish between notation and what is being notated. By any reasonable definition of "the decimal digit in the thousands place", the number seventeen has such a thing, and it is 0. The digit strings '000' and '0' both denote the same number, despite being different strings. In fact, there is another layer here: the way decimal numerals (and notation for them) are usually defined, the two different strings of characters '000' and '0' actually both denote the same *decimal numeral* of infinitely many zeros (and nothing else).2012-02-23
  • 0
    Fair point, @Hurkyl. So, the set of numbers I am talking about are the set of unique ascending natural numbers, as displayed in the "natural" manner; that is, including all (and only) digits up to and including the highest-magnitude nonzero digit. Again, as was mentioned earlier (though slightly incorrectly), if we count leading zeroes up to a finite maximum of N digits, then there will be N-logb(X) zeroes in any value X, and so zero would have an extreme preponderance among the first X natural numbers when displayed in this way.2012-02-23

2 Answers 2

2

If you order them, 0.1234567891011..., then this is Champernowne's constant, which is normal (a much stronger condition of uniformity) in base-10.

  • 1
    I think normality is actually _less_ strong than the OP is talking about; it only says that all digit patterns occur with asymptotically equal frequency, but it's easily possible for this to happen while one digit always has more occurrences 'so far' than another. (Consider the string AABABAB...; then the asymptotic frequency of As and Bs is identical, but any initial substring always has more As than Bs.)2012-02-23
  • 0
    Except that we're considering the general case, in which there is no "so far". The article states it's proven that for base 10 in the general case, all digits occur with equal frequency, which is the answer to my question in base 10. Now the corollary is, will you get a different answer for any other base? The article says that is unknown (not proven at least).2012-02-23
  • 2
    KeithS: there's _always_ a 'so far' - none of these concepts considering the entire list of numbers at once; they're all limiting processes. The 'equal frequency' just says that the limit of ('# of As in the first n symbols'/n) is the same as the limit of ('# of Bs in the first n symbols'/n), but it's still possible to have this happen while '# of As in the first n symbols' is greater than '# of Bs in the first n symbols' for all n. It does come down to your definition of 'more prevalent', but there are multiple valid interpretations of that phrase.2012-02-23
5

For finite intervals you will have a lack of zeros as you have seen. If you go up to $b^n-1$ in base $b$ the rest of the digits will be even. The discrepancies get fractionally smaller as you go up. In the interval $[b^n,b^{n+1}-1]$ you will have $n(b^{n+1}-b^n)$ digits total. All but the leading digit will be equidistributed, giving $(n-1)(b^n-b^{n-1})$ of each. The leading digit will contribute an additional $(b^n-b^{n-1})$ of all the digits except $0$. The fractional excess is then $\frac 1{n-1}$