2
$\begingroup$

Lets say I have a series of 100 digits forming a number. 15 of those digits are always the same at the same place, 85 of those digits are randomly 1 to 5. I generate 10000 numbers this way. What would be the average lowest uniqueness of a random number compared to each other number where you start at 100% uniqueness and go -1% for each number that is the same on the same place.

additional info with what I mean by average lowest uniqueness

By lowest average uniqueness I mean, what would be the average uniqueness of all numbers compared to the one they have the most in common with. Kinda hard to explain example I generate 10.000 numbers with way, what would their average uniqueness be of those 10.000 numbers compared to the one they have to most in common with. So I take one number 1, find out it has the most in common with with number 7, being X% unique, now I do this for all numbers and then make an average of all those x% unique I get

Real life application

The real life application and reason why I ask is this. I write a 100 word text, and for each word I can find synonyms for I give them. for example : This is a {very,extremely} {difficult,complicated,hard} {question,equation}. Now when I generate 10000 texts based on this (with software) where it will each time take one of the options between brackets and put them online. Now google will try to see how unique my text is based on all the text that is online. I want for example to have my text be at least 60% unique compared to any text found online (only my text will really be a factor in this, seeing as I'm writing a 100 word text) I want to get an idea how many synonyms and text length I need to aim for given I want at least 60% uniqueness and I want to generate 5000, 10000 or even 20000 text. If I can get an idea how its calculated or what the value is for my example I can about guess how long and how many synonyms ill have to aim for in case I need 1000 5000 or even 20000 text generated.

  • 0
    @henry where in your equation is the 10000 generated numbers factor. You dont just generate 2 numbers and then compare them, you generate 10000 numbers and then compare them to the one they have most in common with2012-09-05

1 Answers 1

0

Repeating my comment: comparing any two strings, $15$ digits are definitely the the same, and the number of other matches has a binomial distribution with parameters $n=85$ and $p=\frac15$ so the expected number of matches is $15+17=32$. In your scoring system, this is 68% uniqueness and this seems to be what you mean by "average uniqueness".

From the binomial distribution, you can calculate the probability that all $9999$ comparisons with a given string are less than or equal to a particular value, and thus calculate the expected "lowest uniqueness" for a given string. Using R you could try something like

> 100 - 15 - sum( diff(c(0, pbinom(0:85, size=85, prob=1/5)^9999 ) ) * (0:85) ) [1] 52.65595 

You cannot legitimately repeat this for all $49995000$ possible string comparisons, as their uniquenesses are not independent: if strings $A$ and $B$ are close together then string $C$ is more likely to be further away from both of them then if $A$ and $B$ were further apart. But in practice this is unlikely to have a major effect here and, for what it is worth, replacing $9999$ by $49995000$ would change the result to about 42.1 as a reasonable approximation to the expected "lowest uniqueness" across all comparisons of the $10000$ strings.

  • 0
    [R](http://www.r-project.org/) is easy and free and powerful2012-09-06