0
$\begingroup$

I'm not very mathematically oriented (working on fixing that) so excuse me if this is basic :)

I need to calculate "fragmentation ratio" - the best name I could come up with the thing I'm trying to achieve.

I have 400 different lists where each list contains 20 named items (with some meta data). Same name item can exist in multiple lists. I calculated that in my dataset there's 326 unique items among those 8000 items (400*20).

So where I stumble is that how do I calculate a meaningful value for fragmentation of items between those lists?

--- edit ---

It looks like I was even more vague and hard to understand than I thought.

I researched this a bit more and I think better term for this would be similarities between lists and calculating that from the collection.

Here's a small example of the data structures:

List 1: ['this','is','a','list'] List 2: ['is','it'] List 3: ['yes,'it','is','list'] 

Now what I would like to quantify is how much similarities these arrays share.

Does this explain my problem any better? :)

  • 0
    I added a bit more to hopefully make it more understandable.2012-12-21

1 Answers 1

0

One option would be to define similarity as the number of common items between two lists. Does that do what you want?

  • 0
    I thought of that but it seems highly inefficient when you have millions of lists. Hmm, however with map/reduce I could make it run in parallel which would then make it actually feasible solution.2012-12-22