1
$\begingroup$

I'm onto a problem about binary sequence similarity for which I have not found any existing solution. I want to share it and the approaches I have taken, although none of them convince me.

Consider a partition of a range from 1 to N, with K subsets represented as binary sequences: a value of 1 at index i in sequence j means that subset j contains the range's value i. For example, for N=7 and K=3, a possible partition could be k1=1001100, k2=0110000, k3=0000011.

Here, an appropriate distance function should regard as similar those pairs of sequences that contain 1's at close indexes.

Cases with (equally) similar pairs of sequences ("..." means many 0's)

-Case A

*1010000101000...0000101

*0101000010100...0001010

-Case B (both more sparse)

*100010001001000...0000101

*010001000100100...0001010

-Case C (less regions)

*1010000101000...0000000

*0101000010100...0000000

-Case D (different sparseness)

*1101100110110...0011011

*0010000001000...0000100

Case with not so similar sequences, but still quite similar:

-Case M

*1010000101000...1110000

*0101000010100...0001110

Case with rather different sequences:

-Case S

*1010000101000...1110...0000

*0101000010100...0000...0111

I have tried:

*Measuring the distance between each 1 in a sequence to the closest 1 in the other sequence, and summing them. Has the problem that sequences may have different amount of 1's. I tried correcting directly for the amount of 1's and more, but found no solution.

*For two sequences X and Y, using the variation in amount of information between X and xor(X,Y).

*The edit distance.

*Local alignment, where a pair of 1's in the same final position has a positive score.

*How probable it is that a model that generates sequence X also generates sequence Y, but I don't know how to model this.

I hope to hear any comments you may have.

Cheers,

David

  • 0
    You are right. It is corrected now.2012-08-24

2 Answers 2

2

Since we care primarily about the positions of $1$s, the Hausdorff distance between the sets of 1s seems appropriate. But since the standard definition of this distance is sensitive to outliers, I will modify it. First, for an integer $r\ge 0$ let $A_r$ be the $r$-neighborhood of a set $A\subset \{1,\dots,N\}$. For example, the 1-neighborhood of 0100000100011000 is 1110001110111100, and the 2-neighborhood is 1111011111111110. These strings are easy to generate using bitwise shifts: $A_1 = A \text{ or } A_{\rightarrow} \text{ or } A_{\leftarrow}$, and after that $A_{r+1}=(A_r)_1$.

We can measure the distance of $B$ from $A$ by counting the cardinality of $B\setminus A$, $B\setminus A_1$, $B\setminus A_2$, etc., and adding them up, possibly with a weight depending on $r$. Even without any weight, the sum $ \sum_{r=0}^N |B\setminus A_r| $ looks like a reasonable measure of how far and how often elements of $B$ deviate from $A$. (Here $|\cdot |$ is the cardinality of a set.) Of course, this sum will be zero if $B\subset A$, and in any case the similarity measure should be symmetric. So $ d(A,B)=\sum_{r=0}^N |B\setminus A_r|+\sum_{r=0}^N |A\setminus B_r| $ is what I would actually use.

You can fine-tune this by weighting the terms in some way, to penalize the large deviations more or less severely: $ d(A,B)=\sum_{r=0}^N w(r)\, |B\setminus A_r|+\sum_{r=0}^N w(r)\,|A\setminus B_r| $

0

For a sequence of size $n$, $X=x_1x_2...x_n$, consider the operator $\sigma_k$ such that $\sigma_k(X)=x_{k+1}x_{k+2}...x_nx_1...x_k$ let $\sigma_{-k}=\sigma_{n-k}$ and let ${\mathcal N}(X)$ the number of $1$ in the sequence $X$.

$d(X,Y)=\min_{-n

then $d$ should be close to what you need ?

EDIT :

you can consider such definition if you don't want cyclic behavior (but I'm not sure it will be a distance then).

$\sigma_k(X)=x_{k+1}x_{k+2}...x_n0^k$

$\sigma_{-k}(X)=0^kx_{1}x_{2}...x_{n-k}$

  • 0
    Although it would be nice to define a metric, I do not need the similarity measure to be a metric.2012-08-24