Two classic books where you can find all the details about this stuff:
- Gantmacher, Matrix theory, Chelsea.
- Lancaster-Titsmenesky, The theory of matrices, Academic Press.
Actually, for "hand" calculations, this works through the Jordan canonical form: you find the Jordan canonical form of your matrix, together with the change of basis matrix
$ A = SJS^{-1} \ . $
Then you prove that, for any polynomial $p(t)$, you have
$ p(A) = S p(J) S^{-1} \ . $
Hence,
$ f(A) = S f(J) S^{-1} $
and you only need to compute $p(J)$ for Jordan matrices.
Which you do as follows: first, if you have a diagonal block matrix
$ J = \begin{pmatrix} J_1 & 0 & \dots & 0 \\ 0 & J_2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & J_r \end{pmatrix} $
you can easily prove that
$ p(J) = \begin{pmatrix} p(J_1) & 0 & \dots & 0 \\ 0 & p(J_2) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & p(J_r) \end{pmatrix} $
So again, on one hand,
$ f(J) = \begin{pmatrix} f(J_1) & 0 & \dots & 0 \\ 0 & f(J_2) & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & f(J_r) \end{pmatrix} $
and, on the other hand, you just need to know $p(J)$ when $J$ is a Jordan block. If :
$ J = \begin{pmatrix} \lambda & 0 & 0 & \dots & 0 & 0 \\ 1 & \lambda & 0 & \dots & 0 & 0 \\ 0 & 1 & \lambda & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ 0 & 0 & \dots & 1 & \lambda & 0 \\ 0 & 0 & \dots & 0 & 1 & \lambda \end{pmatrix} $
is an $r\times r$ Jordan block, then
$ p(J) = \begin{pmatrix} p(\lambda ) & 0 & 0 & \dots & 0 & 0 \\ p'(\lambda)/ 1! & p(\lambda) & 0 & \dots & 0 & 0 \\ p''(\lambda)/ 2! & p'(\lambda)/ 1! & p(\lambda) & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ p^{(r-2)}(\lambda)/(r-2)! &p^{(r-3)}(\lambda)/(r-3)! & \dots & p'(\lambda)/ 1! & p(\lambda) & 0 \\ p^{(r-1)}(\lambda)/(r-1)! &p^{(r-2)}(\lambda)/(r-2)! & \dots & p''(\lambda)/2! & p'(\lambda)/ 1! & p(\lambda) \end{pmatrix} $
Hence, again you have all in terms of $f$ in fact:
$ f(J) = \begin{pmatrix} f(\lambda ) & 0 & 0 & \dots & 0 & 0 \\ f'(\lambda)/ 1! & f(\lambda) & 0 & \dots & 0 & 0 \\ f''(\lambda)/ 2! & f'(\lambda)/ 1! & f(\lambda) & \dots & 0 & 0 \\ \vdots & \vdots & \ddots & \ddots & \vdots & \vdots \\ f^{(r-2)}(\lambda)/(r-2)! &f^{(r-3)}(\lambda)/(r-3)! & \dots & f'(\lambda)/ 1! & f(\lambda) & 0 \\ f^{(r-1)}(\lambda)/(r-1)! &f^{(r-2)}(\lambda)/(r-2)! & \dots & f''(\lambda)/2! & f'(\lambda)/ 1! & f(\lambda) \end{pmatrix} $
And, in this version of the story, you actually don't need to know your polynomial $p(t)$ for your function $f(t)$ and matrix $A$ -but it's not difficult to find it, anyway: it's called the Lagrange-Sylvester polynomial, which is some sort of mixture between the classic Lagrange interpolation polynomial and a Taylor series.
EDIT
Nevertheless, seems that I forgot to answer the most important question: "Why does this whole thing actually work?"
That is, why defining
$ f(A) = p(A) $
for some polynomial $p(t)$ that agrees with $f(t)$ on the spectrum of $A$ all this makes sense? That is, for what reason can we actually call $p(A)$ (the computable thing) the "value" of $f(t)$ on the matrix $A$?
Because of the following:
Theorem. (Gantmacher, chapter V, $\S 4$, theorem 2). If the function $f(t)$ can be expanded in a power series in the circle $\vert t - \lambda \vert < r$,
$ f(t) = \sum_{n=0}^\infty a_n(t-\lambda)^n \ , $
then this expansion remains valid when the scalar argument $t$ is replaced by a matrix $A$ whose eigenvalues lie within the circle of convergence.
That is, under the hypotheses of the theorem, you have
$ f(A) = \sum_{n=0}^\infty a_n(A-\lambda I)^n \ , $
where the $f(A)$ in the left hand-side means $p(A)$, the value of the Lagrange-Sylvester polynomial on $A$.
So, why not define $f(A)$ as this last power series (i.e., the Taylor series of $f(t)$)? Well, because then you would have to talk a long time about convergence issues of matrix series... And you would end, finally, at the same point: relying on the Jordan canonical form for actual computations. So, the Lagrange-Sylvester device allows you to get rid of convergence issues -if you're willing to accept $f(A) = p(A)$ as a sound definition.