0
$\begingroup$

Recently I came with the problem: define a metric for a space formed by nouns. Here is my formulation:

Let $W$ be the set of all words and let $S \subset W$ be the set of all nouns. Let $A$ be the set of sequences from $W$ such that, for each element $a \in A$, the range of $a$ contains at least one element from $S$.

Define the function $N: S \times S \rightarrow \mathbb{Z}$ given by $N(x, y)$ = the number of sequences on A that has both $x \in S$ and $y \in S$ in its range, where $\mathbb{Z}$ is the set of all integers. Define the map $dep: S \times S \rightarrow \mathbb{R}_+$ such that $dep(x,y) = N(x,x) - N(x,y)$, where $\mathbb{R}_+$ is the set of all non-negative real numbers.

Question: is $dep$ a metric? if not, is it a quasimetric?

  • 1
    Actually I don't want to measure similarity between strings. I want to measure similarity between string meanings. $A$ represents the set of all journal articles.2012-08-21

2 Answers 2

1

I suggest the following approach: to each noun $x$ there corresponds the set $A_x=\{a\in A: x\in a\}$. In plain English, $A_x$ is the set of articles that mention $x$. There is a natural metric on the space of subsets: namely, $d(A,B)=|A\Delta B|$, the cardinality of symmetric difference. This metric induces a pseudo-metric on the set of nouns, $d(x,y)=d(A_x,A_y)$ (a pseudo-metric because it turns to zero when $A_x=A_y$, even if $x\ne y$.) This sort of makes sense: if $x$ and $y$ always appear together, they can be treated as the same entity. Chances are that they form a popular compound noun.

Another way to look at the above is that to each $x$ we associated a function $F_x$ such that $F_x(a)=1$ if $x\in a$ and $F_x(a)=0$ otherwise. The symmetric difference distance is just the $L^1$ distance with the counting measure: $d(x,y)=\sum_a |F_x(a)-F_y(a)|$. But it may be that not all articles are equally important. They could be weighed by the number of citations, or length, or the journal's prestige, etc. The weighted $L^1$ space fits well into the problem: $d_w(x,y)=\sum_a |F_x(a)-F_y(a)|\,w(a)$.

Finally, we may want to take into account the number of appearances of $x$ in $a$, denoted $n(x,a)$ below. The straightforward approach $F_x(a)=n(x,a)$ does not look appropriate. But I would at least experiment with exponential model $F_x(a)=1-2^{-n(x,a)}$. In yet another direction, it is worth considering the einmal ist keinmal principle: $F_x(a)=1$ if $n(x,a)\ge 2$ and $F_x(a)=0$ otherwise.

  • 0
    very good answer! actually you are representing "popularity" of the noun as a function that takes an article and returns some quantity related to the number of appearances of the noun in the article. About your pseudometric $d(x, y)$: if $x$ and $y$ are close related, then $d(x,y)$ is very close to $0$, which is really what I need. About the "pseudo": I can always construct the set $A$ such that, for all $x \in S$ there will exist an article $a \in A$ with $R(a) \cap S = \{x\}$. Therefore, under this restriction, $d(x,y)$ is a metric, which solves my problem.2012-08-23
1

I hope I'm not misunderstanding your definitions.

Suppose $x$, $y$ and $z$ are nouns.

If $A=\{xyz,xy,x\}$ then $dep(x,y)=3-2=1$ and $dep(y,x)=2-2=0$. Non-symmetric!

As I pointed out in the comments, if $A=\{xy,xyz\}$ then $dep(x,y)=0$ with $x\neq y$.

As for the triangle inequality, we can reduce it... $\begin{align} dep(x,y)+dep(y,z)&\geq dep(x,z)\\ N(x,x) - N(x,y) + N(y,y) - N(y,z) &\geq N(x,x) - N(x,z)\\ N(y,y) - (N(x,y) + N(y,z)) &\geq - N(x,z) \end{align}$

I don't have a proof but I believe it holds.

It looks like you have a pseudoquasimetric.