1
$\begingroup$

Fast String Correction with Levenshtein-Automata

This paper uses the terms descendent and stroke, and I can't find definitions for them by googling. Could you recommend a book, or page.

Grazie.

1 Answers 1

1

I don’t know whether this terminology is standard, so I can’t give you a reference, but I can explain how it’s used in the paper. I think that the easiest way to explain is by an example.

Consider the words $u=$ abcde and $v=$ aebbfg; the Levenshtein distance between them is $5$, as illustrated by this sequence of edit operations:

   abcde      abde                              delete c    abbe                              substitute b for d    aebbe                             insert e    aebbf                             substitute f for e      aebbfg                            insert g 

(Other sequences of length $5$ are possible.) The letters of $v$ are of three types.

  1. Those that were in $u$ and were never changed: these are the a and the first b. These letters are the descendants of the a and b in $u$, respectively. (In the paper these letters are the ones ‘not touched by any edit operation’ in the discussion immediately following Lemma 2.0.2.)

  2. Those that are the result of substituting a letter for a letter in $u$: these are the second b and the f. These letters are the descendants of the d and e in $u$, respectively.

  3. Those that were inserted: these are the e and the g. These are not descendants of any letter of $u$.

Note also that the c of $u$ has no descendant in $v$, because it was deleted.

Now construct a bipartite graph $G(u,v)$. Its vertices are the letters of $u$ as one of the parts and the letters of $v$ as the other part. Connect a letter $x$ in $u$ to a letter $y$ in $v$ if and only if $y$ is the descendant of $x$. $G(u,v)$ is the paper’s trace representation, and a stroke in the paper’s terminology is simply one of these edges. In my example the picture looks like this:

                a e b b f g                   |  /  | |                 a b c d e  

(I put $u$ on the bottom and $v$ on the top in order to follow the convention used in Figure 2.1 of the paper.) $G(u,v)$ has four edges; in the paper’s terminology, this trace representation has four strokes. As noted in the discussion immediately following Lemma 2.0.2, no strokes cross, each letter of $u$ that has no attached edge was deleted, and each letter of $v$ that has no attached edge was inserted.

  • 0
    Great explanation. Thank you. (>'-')> <('-'<) ^('-')^ v('-')v <('-'<) (>'-')> ^(^-^)>2012-04-09