My concern is about matrix inversion. Consider This page. I was thinking about creating four different threads for every component of the final matrix. In order to be more specific, I am going to write the formulas written in the link I mentioned before.
$$ \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{bmatrix}^{-1} = \begin{bmatrix} (\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1} & -(\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1}\mathbf{BD}^{-1} \\ -\mathbf{D}^{-1}\mathbf{C}(\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1} & \mathbf{D}^{-1}+\mathbf{D}^{-1}\mathbf{C}(\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1}\mathbf{BD}^{-1}\end{bmatrix} $$
As you can see in the expression above, there are four independent subexpressions in the final matrix. Well, I know, there is this term: $(\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C})^{-1}$ which is in common, but four threads can calculate it independently.
So, if the fastest algorithm is considered (Coppersmith–Winograd algorithm), $O(n^{2,376})$, every thread has to invert two matrices (matrix $\mathbf{A}-\mathbf{BD}^{-1}\mathbf{C}$ and $\mathbf{D}$ are the only inverted ones) then every thread has to bear a complexity like: $2 \cdot O\left[ (n/2)^{2,376} \right]$.
The problem of this approach is that resources are the same. Meaning that if I wanted to use a distributed grid I should send the entire matrix.
Is there some other algorithm to make the inversion faster and resource independent (for threads)?
