1
$\begingroup$

Suppose i have Two Text Articles (article 'a' => x words, article 'b' => y words)

i find the total number of words in 'a' = x the total number of occurances of the word "the" in the article = x1

probability of the word 'the' in article 'a' => x1/x (i have stored this as decimal, so i dont have exact x1 and x at a later stage)

now i want to merge(Add ?) the probability of the same word 'the' in article 'b', which is y1/y

how should i do that, i suppose simply adding will be wrong.

as a later stage i will get more articles 'c', 'd'... and i want to keep the probability updates, how should i do this ?

Thank you.

2 Answers 2

3

You need to keep track of the total size of the articles that you've processed to date. Suppose that $p$ is the probability that a randomly chosen word in the articles processed so far is 'the', and those articles contain altogether $n$ words. You now get a new article with $k$ instances of 'the' in $m$ words. You should then update $p$ to $\frac{pn+k}{n+m}$ and $n$ to $n+m$ in preparation for processing the next article.

3

If I understand what you are doing correctly, then you want as your final case (total number of occurrences of "the") / (total number of words). So in this case, you will take

$\dfrac{x_1 + y_1}{x + y}$

Now if you didn't record $x, y, x_1, y_1$ separately, then I guess you'll have to recount them. Or you at least need to know the proportion of words in one article to the number of words in the other article.

  • 0
    @Pheonix: No problem! I've also just noticed that I misspelt your name the first time I wrote it. I'm sorry -2019-02-14