1
$\begingroup$

I have a collection of $400.000+$ word-pairs. Each word-pair has an association strength, which is a measure of how related the two words are to each other (as in cow-milk). Each word-pair also has a word-frequency, which is how often the two words occur in normal conversation. There is a correlation between the two variables: as association strength increases, word-frequency increases somewhat linearly.

I want to study the effects of association strength and take word-frequency out of the picture. This means I have to correct for word-frequency, so I want to construct a subset of about 1.000 word-pairs with retains the distribution of the association strength, but has no more correlation to word-frequency. I know such a subset exists but am at a loss how to efficiently construct it.

  • 0
    To do this you need to tell us how association strength and word frequency are calculated. If both are essentially linear functions of counts of how often the two words appear together in the same sentence then you may not be able achive your aim.2012-01-20
  • 0
    I fail to see how this matters. The values have been calculated and are associated to the word-pairs. There is a correlation between *association strength* and *word-frequency*, but there is enough room to select pairs with a high *association strength* and relatively low *word-frequency* and vice versa.2012-01-20
  • 0
    Maybe the phrasing of the question is not very clear, any suggestions on rephrasing it are also greatly appreciated.2012-01-20

0 Answers 0