2
$\begingroup$

A DNA molecule can be represented as a string of symbols $A$, $C$, $G$ and $T$, such as $GGATAATTCTAG. . .GACCGTACCC$ For the purposes of this question, we will assume that all DNA molecules contain the same (large!) number $N$ of symbols. Thus, a DNA molecule is an $N$-tuple $x = (x_1, . . . , x_N)$ where $x_i ∈ \{A,C,G,T\}$ for each $i$.

Define the distance between two DNA molecules $x$, $y$ as the number of elements $i ∈ {1, . . . ,N}$ such that $x_i \neq y_i$. Prove that this defines a metric on the set of DNA molecules.

  • 2
    This is an example of the [Hamming distance](http://en.wikipedia.org/wiki/Hamming_distance) (but the fact that people call it like this should not prevent you to show this is indeed a metric, and for this there is no better starting point than @Brian's comment).2012-04-10

1 Answers 1

5

I’ll use $d(x,y)$ for the distance between $x$ and $y$. To show that $d$ is a metric, you must show the following four things:

  1. $d$ is non-negative: For all DNA molecules $x$ and $y$, $d(x,y)\ge 0$.

  2. $d$ separates points: $d(x,y)=0$ if and only if $x=y$.

  3. $d$ is symmetric: For all DNA molecules $x$ and $y$, $d(x,y)=d(y,x)$.

  4. $d$ satisfies the triangle inequality: For all DNA molecules $x,y$, and $z$, $d(x,y)\le d(x,z)+d(z,y)$.

You should have absolutely no trouble verifying (1)-(3): they are trivial consequences of the definition of $d(x,y)$. The only one that requires some work is the triangle inequality. Here’s a suggestion.

Start with three arbitrary DNA molecules $x,y$, and $z$. Let $D_{xy}$ be the set of positions at which $x$ and $y$ differ, i.e., $D_{xy}=\{k:x_k\ne y_k\}$. Similarly, let $D_{xz}$ be the set of positions at which $x$ and $z$ differ, and let $D_{yz}$ be the set of positions at which $y$ and $z$ differ. By definition $d(x,y)$ is just $|D_{xy}|$, the number of elements in $D_{xy}$, and similarly for the other two pairs. Thus, you want to show that $|D_{xy}|\le|D_{xz}|+|D_{yz}|$.

Play with some examples for smallish $N$ and try to see how the sets $D_{xy},D_{xz}$, and $D_{yz}$ are related. For instance, if a position $k$ is in both $D_{xz}$ and $D_{yz}$, is it in $D_{xy}$? To say the same thing in different words, if $x$ and $z$ differ in position $k$, and $y$ and $z$ differ in position $k$, how do $x$ and $y$ compare in position $k$? Then go on to consider the various other possibilities.