The problem is that queries to TM2 do not return a single answer. Instead, the subroutine results in a branching of the computation, without any way to combine branches before the computation ends.
Consider the co-NP language UNSAT (the negation of SAT). The $NP^{NP}$ oracle algorithm is trivial -- query the SAT oracle and return the opposite.
We now try to replace the SAT oracle with an actual TM2 that guesses solutions and checks if the satisfy the given formula. Note that TM1 no longer gets a cleanly-packaged solution. Instead, it has to handle each branch of the sub-computation individually.
Let's run through what happens on input $\phi$.
If a branch of the query returns "YES", then THAT BRANCH knows that $\phi$ is satisfiable and so knows to return a final answer of "NO".
But what if a branch of the query returns "NO"? That branch doesn't know the true status of $\phi$. It could assume that the true answer is "NO" and therefore accept, but that causes the overall computation to accept -- incorrect behavior if $\phi$ was actually satisfiable.
The other option is to abort the "NO branch of the query since it doesn't have enough information to go on. But suppose that $\phi$ is actually unsatisfiable. Then every branch aborts, and the machine (incorrectly) rejects.