1
$\begingroup$

Suppose i have Two Text Articles (article 'a' => x words, article 'b' => y words)

i find the total number of words in 'a' = x the total number of occurances of the word "the" in the article = x1

probability of the word 'the' in article 'a' => x1/x (i have stored this as decimal, so i dont have exact x1 and x at a later stage)

now i want to merge(Add ?) the probability of the same word 'the' in article 'b', which is y1/y

how should i do that, i suppose simply adding will be wrong.

as a later stage i will get more articles 'c', 'd'... and i want to keep the probability updates, how should i do this ?

Thank you.

2 Answers 2

3

You need to keep track of the total size of the articles that you've processed to date. Suppose that $p$ is the probability that a randomly chosen word in the articles processed so far is 'the', and those articles contain altogether $n$ words. You now get a new article with $k$ instances of 'the' in $m$ words. You should then update $p$ to $\frac{pn+k}{n+m}$ and $n$ to $n+m$ in preparation for processing the next article.

3

If I understand what you are doing correctly, then you want as your final case (total number of occurrences of "the") / (total number of words). So in this case, you will take

$$\dfrac{x_1 + y_1}{x + y}$$

Now if you didn't record $x, y, x_1, y_1$ separately, then I guess you'll have to recount them. Or you at least need to know the proportion of words in one article to the number of words in the other article.

  • 0
    You don't need to record *all* of $x,y,x_1,y_1$. For example, if you only know $x' = x_1/x$ and $y' = y_1/y$ then you can calculate $(x'x + y'y)/(x+y)$.2011-07-10
  • 0
    Yuval is absolutely right. Knowing x and y, or the proportion of x to y would both suffice.2011-07-10
  • 0
    @mixedmath at a later stage, i only have **x1/x** (in decimal) and **y1** and **y**, ............... also you say other option is that i need to know the proportion of words in one article to the number of words in the other article...... how does this solve when i have a 3rd, 4th ... and more articles and have to update the probability ?? Thanks.2011-07-10
  • 0
    @Phoenix: Being able to do it with 2 is all you need. Just treat the first 2 articles as a single article and then rinse, wash, and repeat. What does having a second question mark mean at the end of an interrogative?? Is it intoned differently? ;p2011-07-10
  • 0
    @mixedmath: sorry i did not understood what you meant by you question, thanks for the answer though :)2011-07-10
  • 0
    @Pheonix: No problem! I've also just noticed that I misspelt your name the first time I wrote it. I'm sorry -2011-07-10