The construct $A^TA$ for $A$ any $m \times n$ matrix seems to appear often in formulae and results. For example I was reading that square root of eigenvalues of $A^TA$ (an $n \times n$ matrix) are singular values of $A$. If I do it on paper it seems like the diagonal contains squares and the off diagonals contain all possible quadratic combinations (and of course symmetric). What is a good way to think of this construct or how is it used other than in that singular value statement?
Intuitively (concretely actually) what happens when you multiply a matrix by its transpose?
3 Answers
Let $\langle v, w \rangle$ be the usual dot product on $\mathbb{R}^n$. Then any symmetric bilinear form on $\mathbb{R}^n$ can be uniquely represented in the form $\langle v, Bw \rangle$ where $B$ is some symmetric matrix. (It is a good idea to see explicitly how the coefficients of $B$ determine the coefficients of the corresponding quadratic form $\langle v, Bv \rangle$, which you can think of as determining an ellipse $\langle v, Bv \rangle = 1$.)
In particular, when you change coordinates so that your old vectors $v$ are replaced by new vectors $Av$ for some matrix $A$, then the dot product behaves on new vectors like
$\langle Av, Aw \rangle = \langle v, A^T A w \rangle$
so $A^T A$ is a matrix describing a bilinear form related to the dot product by change of coordinates. More concretely, you can think of $A^T A$ as describing the coefficients of an equation for an ellipse $\langle Av, Av \rangle = 1$ related to the unit sphere $\langle v, v \rangle = 1$ by a change of coordinates.
$A^TA$ is sort of projection operator onto the row space or coimage space (without normalization), while $A^+A$ is the 'real' projector, where $A^+$ is the pseudoinverse of $A$. Actually, when the left inverse $A_L^{-1}$ exists (full column rank), $A^+=A_L^{-1}=(A^TA)^{-1}A^T$.
A particularly good way to think of this construct is as an object that behaves in the way that those often seen formula and results need things to behave. :)
IMO, many of those behaviors resemble "squared norms", making $A^T A$ a better candidate for a quadratic function of $A$ than $A^2$ is. This is especially so when you consider complex matrices and use conjugate transpose instead of transpose. Or if you view $A$ as a collection of column vectors.