Please, O mathematicians, help me understand the approach to the problem of estimating transition probabilities given only aggregate data in Kalbfleisch & Lawless' 1984 paper "Least-Squares Estimation of Transition Probabilities from Aggregate Data". (Or some other approach—it's the estimation that interests me, not that approach in particular.)
Forgive the somewhat janky but hopefully comprehensible notation.
Here's my particular problem. We have some number of people; each person $i$ is in a state $X_i(t)$ at time $t$ and can move at time $t+1$ into any state $0 \ldots X_i(t)+1$ inclusive. At time $t=0$ everyone is in state $0$.
The individual state transitions are unobserved; all we observe are the counts of people in a given state $j$ at time $t$: $\mathbf{M}_{t,j} = \#\{i : X_i(t) = j\}$. Because everyone starts in state $0$ at time $0$, and because you can only move to at most state $k+1$ if you were formerly in state $k$, this generates an upper-triangular matrix of counts.
We want to know the probabilities of the possible state transitions $\mathbf{p}_{k,j} = \Pr(X_i(t+1) = j \mid X_i(t) = k)$ given the counts in $\mathbf{M}$.
Kalbfleisch & Lawless's paper seems to be about exactly this. (The paper is available here if you want to look at it.) However, I can't figure out how to apply their approach at all. (Doubtless because of my more general ignorance.) They give a couple of examples so I wanted to work through them and hit a wall pretty quickly. Some things in particular I am failing to understand:
- On p. 171 several equations are defined prefaced by the clause "if there are no structural $0$'s, so that $r = k(k-1)$" or "if $r = k(k-1)$". However, in the first example, there are structural zeroes and the answers given to the examples presume that (if you dont' take the structural zeroes into account the dimensions of several matrices will differ from what they are claimed to be in the text). There are also structural zeroes in my particular problem, so I would like to know if there's a straightforward way to proceed in that case. 
- Also on p. 171 the matrices $\mathbf{B}_t$ are defined for $t = 1, \ldots, m$, but later equations involve the summation from $t = 1 \ldots m$ of terms involving $\mathbf{B}_{t-1}$, meaning that $\mathbf{B}_0$ must be defined. 
- Perhaps more fundamentally I don't really understand why the last column of their matrix $\mathbf{P}$ is cut off—don't we want to know e.g. $p_{13}$ as well as $p_{12}$? 
I suspect I am missing some very basic things, but I know very little about the subject (I'm interested in it for a particular practical problem). If anyone can help me understand how to compute what I'm after, or the mechanism described in the paper, that would be immensely helpful. Thanks.
