1
$\begingroup$

Suppose I have a multiset of numbers. I'm interested in the number of unique n-tuples that can exist using the numbers from this multiset. Now of course a closed form is of interest here, but what I'm really after is some way to distinguish how "full" the set of unique n-tuples is, perhaps related to the information theory notion of entropy. The more repeats that exist in the multiset, the less full it is; the more unique numbers there are, the more unique n-tuples exist. For argument's sake, let's call this measure dimensionality.

Some motivation and explanation: I'm interested in tying parts of information theory back into run-time complexity analysis (and possibly reinventing the wheel along the way). When computer scientists think of algorithms for problems like sorting, they often consider questions like "How many comparisons need to be done to sort this multiset of numbers?" In that case, a comparison is analogous in some sense to a two-tuple. Many other problems, I believe, are also largely based on the number of unique n-tuples of a multiset. Often times these problems are approached with what I consider to be crude tools where worst-case and average-case are considered, and often little more detail is provided (or at least this what my limited college education has provided me with). Back to the sorting example for a second: there are certainly differences between sorting a multiset of 100000 unique numbers or 100000 identical numbers (to use a tad bit of hyperbole), but how do you measure the in-between cases? I believe that the measure provided here can potentially tie back into providing a more fine-grained approach to run-time complexity analysis.

In summary:

  • I'll accept any answer that gives some pre-existing measure of "dimensionality" as used in this context.
  • I'll give upvotes for closed forms of the number of unique n-tuples. I'm sure this is a well-studied problem by somebody, so I don't want to mark these as answers for a while in case a measure exists.
  • I'd love to hear if anybody has applied some notion of entropy to analyzing run-time complexities for given datasets.
  • 0
    Thanks! I had this vague recollection of generating functions but I had this notion that they wouldn't work for this scenario... Glad you fixed that!2012-06-18

1 Answers 1

1

Let $A$ be the underlying set of your multiset and $m(a)$ the multiplicity of $a\in A$. For $n=\sum_{a\in A} m(a)$ the relation is is direct (probably Tim Duff's question was oriented towards this). The number of distinct tuples $N$ is just: \begin{equation} N=\frac{n!}{\prod_{a\in A} np(a)!} \end{equation} where $p(a)=m(a)/n$. N has a nice relationship with the entropy of the induced probability distribution $A$ (abusing notation) \begin{equation} N2^{-nH(A)}\leq 1 \end{equation} See Cover and Thomas Ch 11.1 The method of types for two different proofs.

Then for $n<\sum_{a\in A} m(a)$ you have for all the $\mathcal P$ combinatorial probability distributions induced from a set of $n$ elements of the multiset:

\begin{equation} \sum_{P\in\mathcal P}N_P2^{-nH(P)}\leq1 \end{equation} where $N_P$ is the number of sequences for the distribution $P$. Then the following naive bound holds for the total number of sequences $N$: \begin{equation} 1\geq\sum_{P\in\mathcal P}N_P2^{-nH(P)}\geq2^{-nH(Q)}\sum_{P\in\mathcal P}N_P=N2^{-nH(Q)} \end{equation} that is: \begin{equation} N\leq2^{nH(Q)} \end{equation} where $Q$ is the most uniform distribution you can get taking $n$ elements from your multiset.

  • 0
    Well, dimensionality should, I think, scale from 1 to n, so essentially I'm removing the scaling by n. I have this thought that if we're dealing with n-tuples that "full" dimensionality should be n. However, a problem with my current approach is that, like entropy, I would expect a uniform pdf of single elements (all elements in the bag are unique) to yield a "full" dimensionality. But it wouldn't. It would only yield N = n C (cardinality of the underlying set). ... Hmmm... Anyways, thanks - I'll accept the answer now.2012-06-19