9
$\begingroup$

I have a machine learning regression problem. I need to minimize

$\sum_i \|Ax_i-b_i\|_2^2$

However I am trying to find matrix $A$, not the usual $x$, and I have lots of example data for $x_i$ and $b_i$. In general $A$ has more rows than columns.

Additionally I would like a solution for minimizing the Mahalanobis distance version, where $C$ is the SPSD covariance matrix:

$\sum_i(Ax_i-b_i)^TC^{-1}(Ax_i-b_i)$

I'm thinking that my problem can be re-written into the usual least squares problem but not sure if I am doing it correctly.

  • 0
    I corrected the text to clarify the L2 norm.2012-09-18

3 Answers 3

7

Yes. Expanding the sum we get

$\begin{align*}\sum_i \|Ax_i-b_i\|^2 &= \sum_i \sum_j (Ax_i-b_i)_j^2\\ &= \sum_i \sum_j \left(\sum_k A_{jk}x_{ki} - b_{ji}\right)^2\\ &= \sum_i \sum_j \left(\sum_k x^T_{ik} A^T_{kj} - b^T_{ij}\right)^2\\ &= \sum_i \sum_j (x^T A^T_j - b^T_j)_i^2\\ &= \sum_j \|x^T A^T_j - b_j^T\|^2. \end{align*}$

So you can solve the problem using regular least squares; your matrix $x^T$ has rows equal to your original known vectors $x_i$, the unknown vector $A_j^T$ is the $j$th row of $A$, and $b_1^T$ is the vector containing all of the first entries of the $b_i$, etc.

  • 0
    Sorry about that it is supposed to be the L2 norm.2012-09-18
2

For the second problem, note $\frac{\rm d}{\rm{d} A}((Ax-b)^TC^{-1}(Ax-b))=-2xb^T{C^{-1}}^T+{C^{-1}}^TAxx^T+C^{-1}Axx^T$ so if $C^{-1}$ is symmetric (hence $C$ is also symmetric), we have the normal equation for $\sum_i(Ax_i-b_i)^TC^{-1}(Ax_i-b_i)$ $\sum_iAx_ix_i^T=\sum_iCx_ib_i^TC^{-1}$ or $\sum_ix_ix_i^TA^T=\sum_iC^{-1}b_ix_i^TC$

So $A$ can be computed row by row, and let $a_j$ be the $j$-th column of $A^T$ ($j$-th row of $A$) and $d_j$ be the $j$-th column of $\sum_iC^{-1}b_ix_i^TC$, thus $a_j$ can be computed by solving $\sum_ix_ix_i^Ta_j=d_j$

  • 0
    OK thats helpful, thanks.2012-09-19
1

If you're really interested in the $L^1$ norm, then this is computationally nontrivial. Luckily there have been many recent advances in the last 10 years. Look at the key 2005 paper by Nesterov, Smooth minimization of nonsmooth functions.

Another competitive method is Bregman $L^1$ optimization, but I don't know very much about it.

  • 0
    This is useful but I did intend the L2 norm, which is consistent with the second matrix equation.2012-09-18