What's doing guys? I was going to post this on MathOverflow initially, but got scared off by their FAQ, haha.
Apologies in advance if I butcher any of the terminology; I majored in math in undergrad but a lot of that knowledge left me like water through a sieve.
So, what I've been playing with are similarity calculations for n-dimensional vectors. Specifically, cosine similarity. What I start out with are multiple arrays/matrices/sets. For example:
SetN = { red, white, blue } Set1 = { 30, 25, 25 } Set2 = { 20, 18, 6 }
Where the individual elements represent frequency, e.g. for Set1, there were 30 instances of red, 25 instances of white, and 25 instances of blue, and so on.
The examples above are simplified, and the actual sets I'm working with have exponential distributions and other complications. This necessitates other questions which I will save for StackExchange. At the moment though, what I want to figure out is if I am calculating the cosine similarity correctly because with certain sets, I get answers that look plain wrong.
SetX = { 20, 0, 0 } SetY = { 20, 20, 20 } Norm = |( red, white, blue )| = sqrt( red^2 + white^2 + blue^2 )
Then I use the norm to convert the sets to unit vectors by dividing each element in the set:
uvX = { 1, 0, 0 } uvY = { 1/sqrt(3), 1/sqrt(3), 1/sqrt(3) }
Then, I find the dot product to calculate the cosine similarity, and I get 1/sqrt(3). Which comes out to:
57.7%.
The thing is, even taking into account that I normalized the vectors to unit vectors, 20 units of red in Set X does not seem like 57.7% of Set Y with 20 red, 20 white, and 20 blue.
It seems like the similarity figure is a bit too large. Are my calculations botched? Or is there some kind of mental illusion I'm failing to see here.