1
$\begingroup$

I have a collection of $400.000+$ word-pairs. Each word-pair has an association strength, which is a measure of how related the two words are to each other (as in cow-milk). Each word-pair also has a word-frequency, which is how often the two words occur in normal conversation. There is a correlation between the two variables: as association strength increases, word-frequency increases somewhat linearly.

I want to study the effects of association strength and take word-frequency out of the picture. This means I have to correct for word-frequency, so I want to construct a subset of about 1.000 word-pairs with retains the distribution of the association strength, but has no more correlation to word-frequency. I know such a subset exists but am at a loss how to efficiently construct it.

  • 0
    Maybe the phrasing of the question is not very clear, any suggestions on rephrasing it are also greatly appreciated.2012-01-20

0 Answers 0