I'm trying to understand this algorithm by Charles Van Loan for evaluating a matrix polynomial $p(\mathbf A)=\sum\limits_{k=0}^q b_k \mathbf A^k=b_0\mathbf I+b_1\mathbf A+\cdots$ (where $\mathbf A$ is $n\times n$):
$\displaystyle s=\lfloor q\rfloor, \quad r=\lfloor q/s\rfloor$
$\displaystyle \mathbf Y=\mathbf A^s$
$\displaystyle \text{for}\quad j=1,\dots,n$
$\displaystyle\quad\quad \mathbf y_0^{(j)}=\mathbf e_j$ ($\mathbf e_j$ is the $j$th column of the identity matrix)
$\displaystyle\quad\quad \text{for}\quad k=1,\dots,s-1$
$\displaystyle\quad\quad\quad \mathbf y_k^{(j)}=\mathbf A\mathbf y_{k-1}^{(j)}$
$\displaystyle\quad\quad \mathbf f_0^{(j)}=\sum_{k=q}^{rs}b_k\mathbf y_{k-rs}^{(j)}$
$\displaystyle\quad\quad \text{for}\quad k=1,\dots,r$
$\displaystyle\quad\quad\quad \mathbf f_k^{(j)}=\mathbf Y\mathbf p_{k-1}^{(j)}+\sum_{i=0}^{s-1}b_{s(r-k)+i}\mathbf y_{i}^{(j)}$
I've tried going through it a number of times, but it seems that the quantity (vector?) $\mathbf p_{k-1}^{(j)}$ was never defined anywhere in the paper. What might it be? Also, the paper claims (or more probably it's how I (mis)understood the paper) that only three matrices need to be stored for the evaluation: $\mathbf A$, $\mathbf Y$, and the final $p(\mathbf A)$. However, I can't see how to properly organize things so that an extra array for storing the $\mathbf y_k^{(j)}$ is not needed (otherwise, four matrices are now needed instead of the three claimed in the paper).
Could anyone enlighten me on how to implement this algorithm properly? Better yet, is there any code available anywhere that implements this algorithm (I'm language-agnostic, so whatever code you post, I can likely translate into the language I'm currently using).
Thanks for any help!