9
$\begingroup$

I have $n$ data blocks and $k$ parity blocks distributed across $m$ boxes where each box can contain atmost $b$ blocks. Each parity block is Ex-or of some data blocks (for ease of understanding we can assume each data/parity block as a single bit) and each data block is involved in atleast one Ex-or operation to get some parity block. Each data block can be involved in more than one parity block operation, infact will be as that will increase the fault tolerance of the code. So, we have $k$ parity equations as follows:

$\begin{aligned} P_1 &= D_{11} \oplus D_{12} \oplus \cdots \\ \;\vdots \\ P_k &= D_{k1} \oplus D_{k2} \oplus \cdots \end{aligned}$

Now if some data or parity block fails and we cannot recover the lost data blocks from the non-failed data/parity blocks we call this situation a "dataloss" situation. Now, I will give a simple example to explain this. Assume that we have $4$ data blocks and $3$ parity blocks where the parity equations are as follows: $\begin{aligned} P_1 &= D_{1} \oplus D_{3} \oplus D_{4}\ \\ P_2 &= D_{1} \oplus D_{2} \oplus D_{3}\ \\ P_3 &= D_{2} \oplus D_{3} \oplus D_{4} \end{aligned}$

We represent them by a matrix called generator matrix which is a $4$x$7$ matrix which is as follows:

$ G=\left(\begin{array}{ccccccc} 1&0&0&0&1&1&0\\ 0&1&0&0&0&1&1\\ 0&0&1&0&1&1&1\\ 0&0&0&1&1&0&1 \end{array}\right) $

Now, suppose $D1$,$D2$,$D3$ fails and we want to check whether that causes dataloss. So, we remove the columns corresponding to $D1$,$D2$,$D3$ from G and call the matrix we get "recovery matrix" which is

$ G'=\left(\begin{array}{cccc} 0&1&1&0\\ 0&0&1&1\\ 0&1&1&1\\ 1&1&0&1 \end{array}\right) $ here. Now, we check the rank of $G'$. If this is less than $n$ then we have $dataloss$, otherwise not. Here $rank$(G')=4 and $n$=4 and we don't have any $dataloss$. Basically, if we Ex-or $P1$,$P2$,$P3$ then we can recover $D3$, from that we can recover $D1$,$D2$.

A data/parity block fails if and only if the box containing it fails and if a box fails then all the data and parity blocks inside it fails. Now we want to formulate the following optimization problem:

Given $n$, $k$, $m$ and the $k$ parity equations (we are not allowed to choose which data blocks are xor-ed to form a check block, we are given an erasure code system) we want an assignment of the disk and parity blocks across $m$ boxes such that we can minimise the number cases where a box failure causes "dataloss"; i.e I want to minimise the function $G=\sum_{i=0}^m\;g(i)$, where

$g(i) = \begin{cases} 1 & \text{if failure of box }i\text{ causes "dataloss"} \\ 0 & \text{otherwise} \end{cases}$

Now the assignment of disk/parity blocks across $m$ boxes can be represented by a matrix of dimension $(n+k) \times m$, lets call it $B$. The $(i,j)$-th entry of the matrix is $1$ if the $j$-th box contains $i$-th data block ($(i-n)$-th parity block) for $i \le n$ (for $i>n$), otherwise it is $0$. So, the function $G$ is a function of all those matrix entries and the constraints we have are:

  1. each matrix entry is either 0 or 1
  2. sum of the entries in a row is always 1 (i.e each data/parity block is present in exactly one box)
  3. sum of the entries in each column is at least 1 (i.e. no box is empty)
  4. summation of all the matrix entries is $n+k$

Now, the function $G$ is coming as a non-linear function of the matrix entries. Also the domain set of $G$ is set of all $n$-tuples where elements of a tuple are 0 or 1. So, the domain set is not a convex set, so convex optimization techniques can't be applied here. How to solve this optimization problem?

If I refine problem like this: How much of the $n+k$ blocks can be covered using m subsets each of size atmost b then I think it is possible to formulate it as a maximum independent set problem. We will form a graph of n+k vertices where each vertex corresponds to one block(data/parity) but the definition of edges is not clear to me (this is the main problem I am trying to solve). Then the maximum independent set of this graph will give us the maximum subset (of the n+k columns) that can be covered.

Somehow I feel that the problem is NP-Complete.

Hope I have described the problem clearly.

======

Edit (comment from Jyrki Lahtonen): This was also posted at MO. I try to redirect eventual interested MO-posters here, because at the moment I think we have a better description of the question here.

  • 0
    It seems that the minimization problem is NP-complete even if we assume that the B matrix entries are in [0,1] rather than in {0,1}2011-08-23

3 Answers 3

2

Sorry to add another answer, but I feel that my first answer contains relevant clarifications to the problem definition, so I think that keeping it separate is justified. Also that other answer is already so long that editing it is painfully slow given that the preview renderer is eating all the cycles.

IMHO the only worthy goal here is that it never happens that the loss of a single box results in data loss. If you cannot tolerate the loss of a single box, you need to engineer that particular box to be fool-proof, and therefore you might as well keep all the data in there. Ok, you may give some weight to partial data recovery, but you didn't ask for it, and then we definitely need to know more about the erasure recovery code.

Here's my first suggestion. The good old greedy algorithm. First randomize the order of the $n+k$ data and check blocks (not necessary, but I want to add a non-deterministic element), so we have blocks $b_1,b_2,\ldots,b_{n+k}$. The algorithm will assign each and every one of these to one of the boxes as follows.

  1. Initialize $j=1$. Initialize the list of unassigned blocks to the list $b_1,\ldots,b_{n+k}$.
  2. Initialize box $\#j$ to be empty. Initialize a pointer $p$ point to the first entry in the list of unassigned blocks.
  3. Check whether adding the block pointed at by $p$ to box $\#j$ violates the rank condition or not (gaussian elimination does this in polynomial time). If it does not, then assign that block to box $\#j$, advance the pointer $p$ and remove that block from the list of unassigned blocks.
  4. If box $\#j$ is full, or the pointer $p$ went beyond the end of the list, then go to step 5. Otherwise go to step 3.
  5. If you have assigned all the data blocks exit with success. Otherwise increment $j$. If you are out of boxes, exit with failure. Otherwise go to step 2.

If one random order doesn't work, try a different one.

It seems to me that the complexity of this is polynomial in $n+k$. It exits successfully, if it finds an assignment of the desired type. You may also want to leave the number of boxes open, in which case the algorithm always succeeds, but should also output the number of boxes it needed.

But I would guess that you (or somebody in your team) have already tried this. What's wrong with it? No proof? Would need to have a (heuristically) justifiable reason to prefer one random order over another?

  • 0
    @Prasenjit: But I agree that this is not a solution to your original question. The greedy algorithm has the air that: given $n$, $k$, $b$ and the erasure recovery code, minimize $m$. It doesn't solve that one either, but I would think that for a widish range of values the output would not be too far off the minimum $m$. But: no proof, no cigar :-(2011-08-28
1

This is too long to be comment, but I need to ask this to get/give a better idea of the problem setting. I cannot tell for sure, but based on my exposure to related problems I think it looks like the following: If the $n\times (n+k)$ binary matrix describing the presence of a data block in a data or a check block is $A$, then $A$ has $I_n$ block in the left (data block $D_i$ copied to its own block) and then a `random' (or carefully designed) $n\times k$ binary block $B$ telling us which data blocks are XORed to form that check block: one column per check block, so an entry '1' on a row $i$ and column $j$ of the matrix $B$ tells that data block $\#i$ is among those XORed to form check block $\#j$.

Now distributing these $n+k$ blocks of data into boxes amounts to partitioning the columns of the matrix $A$ into $m$ subsets. AFAICT data loss occurs, if and only if removal of the columns belonging to the box in question leaves us with a matrix A' that is no longer of full rank $n$. This is because, if the matrix A' were of full rank, we could select a subset of $n$ linearly independent columns. As they form an invertible matrix, data recovery could be done (by e.g. Gaussian elimination).

Is this what you have in mind? This seems to be closely related to erasure correction/recovery codes, and you may benefit from using that as a buzzword, when searching for more information.

Before we get serious, let me describe a toy example of the above. Let $ A=\left(\begin{array}{ccccccc} 1&0&0&0&1&1&0\\ 0&1&0&0&0&1&1\\ 0&0&1&0&1&1&1\\ 0&0&0&1&1&0&1 \end{array}\right) $ be this matrix that some of you will recognize as the generaotr matrix of the the $[7,4,3]$ Hamming code. Here the last 3 columns tell that the first check block is constructed by XORring data blocks 1,3 and 4, the second check block by XORring data blocks 1,2,3 and the last check block by XORrin data blocks 2,3,4.

I claim that in this case using $m=3$ boxes allows us to always recover from the loss of any single box. Put data blocks 1,2 into the first box; data blocks 3,4 into the second box; and all the check blocks into the third box. If we lose the third box, then we obviously have all the data in tact. If we lose the first box, then we can recover data block #1 by XORring the first check block with data blocks 3 and 4, and similarly we can recover data block #2 by XORring the last check block with data blocks 3 and 4. Similarly if we lose the second box, then we can recover the missing data blocks 3 and 4 usigng the the saved check blocks and the saved data blocks 1 and 2.

Because this binary code has minimum Hamming distance 3, it can recover from any two erasures. Therefore I knew in advance that I woudl be able to recover the lost contents of either of the first two boxes.

Please comment. I realize that you may be looking for a different approach. May be one with simpler recovery schemes?

  • 0
    @Jyrki: basically I want a general algo/optimization technique2011-08-14
1

The OP has also posted this problem at StackOverflow and clarified it in Comments there.

The Question can be framed as a combinatorial one. Given a "universe" set $S$ of size $n$, consider a collection $A$ of subsets of $S$ consisting of the $n$ singleton subsets together with $k$ additional (distinct) subsets of size at least two. Thus $|A| = n+k$.

We are then asked to partition $A$ using at most $m$ parts and with parts of size at most $b$, where $mb \ge n+k$. The objective is to find such a partition in which as few of the parts as possible have a certain property, which Prasenjit refers to as "critical". This motivates the definition:

A subset $B$ of $A$ has property C iff the Boolean algebra generated by $A \backslash B$ is a proper subalgebra of $\mathscr{P}(S)$.

A little reflection should convince the reader that the point set condition above amounts to the criterion Prasenjit gives for the removal of columns from generator matrix $G$ resulting in a matrix of rank less than $n$.

Prasenjit's question is then: Considering all partitions of $A$ having at most $m$ parts of at most size $b$, what is the least possible number of parts having property C.

My suggestion is to first determine how much of $A$ can be covered by at most $m$ subsets $B$ of size at most $b$ not having property C. Since the condition of not having property C is closed under taking subsets, such a cover can always be refined to a partition with no parts having property C. If possible this would attain a minimum.

If not possible then at least one part having property C would be required.

Since all the $k$ non-singleton sets in $A$ form a subset not having property C, it is only a singleton set in $A$ that might be impossible to include in a "block" $B$ (subset of $A$ of size at most $b$) not having property C.

However the parameter $m$ might be specified so low that even though every set in $A$ is included in a "block" not having property C, it is nonetheless impossible to cover all $A$ with just $m$ such blocks. This would of course be the case if $m=1$, although it would then be clear that the minimum number of blocks with property C required is 1.

  • 0
    Another discussion is going on at http://cstheory.stackexchange.com/questions/8050/is-this-minimization-problem-np-complete2011-09-02