I'm onto a problem about binary sequence similarity for which I have not found any existing solution. I want to share it and the approaches I have taken, although none of them convince me.
Consider a partition of a range from 1 to N, with K subsets represented as binary sequences: a value of 1 at index i in sequence j means that subset j contains the range's value i. For example, for N=7 and K=3, a possible partition could be k1=1001100, k2=0110000, k3=0000011.
Here, an appropriate distance function should regard as similar those pairs of sequences that contain 1's at close indexes.
Cases with (equally) similar pairs of sequences ("..." means many 0's)
-Case A
*1010000101000...0000101
*0101000010100...0001010
-Case B (both more sparse)
*100010001001000...0000101
*010001000100100...0001010
-Case C (less regions)
*1010000101000...0000000
*0101000010100...0000000
-Case D (different sparseness)
*1101100110110...0011011
*0010000001000...0000100
Case with not so similar sequences, but still quite similar:
-Case M
*1010000101000...1110000
*0101000010100...0001110
Case with rather different sequences:
-Case S
*1010000101000...1110...0000
*0101000010100...0000...0111
I have tried:
*Measuring the distance between each 1 in a sequence to the closest 1 in the other sequence, and summing them. Has the problem that sequences may have different amount of 1's. I tried correcting directly for the amount of 1's and more, but found no solution.
*For two sequences X and Y, using the variation in amount of information between X and xor(X,Y).
*The edit distance.
*Local alignment, where a pair of 1's in the same final position has a positive score.
*How probable it is that a model that generates sequence X also generates sequence Y, but I don't know how to model this.
I hope to hear any comments you may have.
Cheers,
David