1
$\begingroup$

I'm onto a problem about binary sequence similarity for which I have not found any existing solution. I want to share it and the approaches I have taken, although none of them convince me.

Consider a partition of a range from 1 to N, with K subsets represented as binary sequences: a value of 1 at index i in sequence j means that subset j contains the range's value i. For example, for N=7 and K=3, a possible partition could be k1=1001100, k2=0110000, k3=0000011.

Here, an appropriate distance function should regard as similar those pairs of sequences that contain 1's at close indexes.

Cases with (equally) similar pairs of sequences ("..." means many 0's)

-Case A

*1010000101000...0000101

*0101000010100...0001010

-Case B (both more sparse)

*100010001001000...0000101

*010001000100100...0001010

-Case C (less regions)

*1010000101000...0000000

*0101000010100...0000000

-Case D (different sparseness)

*1101100110110...0011011

*0010000001000...0000100

Case with not so similar sequences, but still quite similar:

-Case M

*1010000101000...1110000

*0101000010100...0001110

Case with rather different sequences:

-Case S

*1010000101000...1110...0000

*0101000010100...0000...0111

I have tried:

*Measuring the distance between each 1 in a sequence to the closest 1 in the other sequence, and summing them. Has the problem that sequences may have different amount of 1's. I tried correcting directly for the amount of 1's and more, but found no solution.

*For two sequences X and Y, using the variation in amount of information between X and xor(X,Y).

*The edit distance.

*Local alignment, where a pair of 1's in the same final position has a positive score.

*How probable it is that a model that generates sequence X also generates sequence Y, but I don't know how to model this.

I hope to hear any comments you may have.

Cheers,

David

  • 0
    I don't understand how you're using the term "partition". Usually a partition of a set contains subsets of that set; in your case it appears to contain elements of the set; I think you should add a definition to clarify that.2012-08-23
  • 0
    I mistakenly wrote partitions instead of subsets. Additionally, I added an example, although the key of the problem is better seen in the subsequent examples. Thanks for the comment.2012-08-23
  • 0
    If I understand correctly what you mean, then you forgot to change one of the occurrences of "partition" to "subset".2012-08-23
  • 0
    You are right. It is corrected now.2012-08-24

2 Answers 2

2

Since we care primarily about the positions of $1$s, the Hausdorff distance between the sets of 1s seems appropriate. But since the standard definition of this distance is sensitive to outliers, I will modify it. First, for an integer $r\ge 0$ let $A_r$ be the $r$-neighborhood of a set $A\subset \{1,\dots,N\}$. For example, the 1-neighborhood of 0100000100011000 is 1110001110111100, and the 2-neighborhood is 1111011111111110. These strings are easy to generate using bitwise shifts: $A_1 = A \text{ or } A_{\rightarrow} \text{ or } A_{\leftarrow}$, and after that $A_{r+1}=(A_r)_1$.

We can measure the distance of $B$ from $A$ by counting the cardinality of $B\setminus A$, $B\setminus A_1$, $B\setminus A_2$, etc., and adding them up, possibly with a weight depending on $r$. Even without any weight, the sum $$ \sum_{r=0}^N |B\setminus A_r| $$ looks like a reasonable measure of how far and how often elements of $B$ deviate from $A$. (Here $|\cdot |$ is the cardinality of a set.) Of course, this sum will be zero if $B\subset A$, and in any case the similarity measure should be symmetric. So $$ d(A,B)=\sum_{r=0}^N |B\setminus A_r|+\sum_{r=0}^N |A\setminus B_r| $$ is what I would actually use.

You can fine-tune this by weighting the terms in some way, to penalize the large deviations more or less severely: $$ d(A,B)=\sum_{r=0}^N w(r)\, |B\setminus A_r|+\sum_{r=0}^N w(r)\,|A\setminus B_r| $$

0

For a sequence of size $n$, $X=x_1x_2...x_n$, consider the operator $\sigma_k$ such that $$\sigma_k(X)=x_{k+1}x_{k+2}...x_nx_1...x_k$$ let $\sigma_{-k}=\sigma_{n-k}$ and let ${\mathcal N}(X)$ the number of $1$ in the sequence $X$.

$$d(X,Y)=\min_{-n

then $d$ should be close to what you need ?

EDIT :

you can consider such definition if you don't want cyclic behavior (but I'm not sure it will be a distance then).

$$\sigma_k(X)=x_{k+1}x_{k+2}...x_n0^k$$

$$\sigma_{-k}(X)=0^kx_{1}x_{2}...x_{n-k}$$

  • 0
    First of all, thank you for answering. The operator you define matches the sequences in a circular way, but I was intending for a simple longitudinal similarity. If you repeatedly apply a similar operator which leaves gaps instead of filling the displaced sequence with the other end, the objective would be to minimise the number of operations and the K's, while maximising N(X xor Y).2012-08-24
  • 0
    why maximizing the xor ? The xor is 0 in case of equal sequences, as the distance should be.2012-08-24
  • 0
    You are right, it needs to be minimised too. Now, moving the whole sequence at once only gives the oportunity for matching to some 1's. Applying the operator as many times as needed in order to match each 1 at least once would be more comprehensive. This is similar to counting the distance between each 1 in a sequence and the closest 1 in the other sequence.2012-08-24
  • 0
    Although it would be nice to define a metric, I do not need the similarity measure to be a metric.2012-08-24