I find the remark about going through $L^1$ unnecessary. It would be more appropriate for the strong derivative (limit of $(y(t+h)-y(t))/h$ as $h\to 0$), which naturally takes values in the same vector space as $y$. But the weak derivative is different.
Weak derivatives are defined by an axiomatic version of integration by parts: $y_t$ is a function $g$ such that $\int_0^T g\varphi=-\int_0^T y\varphi'$ for all "test functions" $\varphi$. It's up to us to decide what the space of allowable "test functions" should be; the smaller it is, the more permissible is our definition of derivative. The natural choice in this setting is $\varphi\in C^{\infty}_0(0,T;V)$, because $V$ is the smallest Hilbert space we have here. Then $\int_0^T g\varphi$ makes sense whenever $g\in L^2(0,T, V^*)$ because the integral is parsed as $\int_0^T \langle g(t),\varphi(t)\rangle\,dt$, with angle brackets indicating the pairing between $V^*$ and $V$.
Here is another way to see why $y_t$ needs to take values in $V^*$. This setup is supposed to help us reformulate the heat equation $y_t=\Delta_x y$ (and other parabolic problems) in terms of functional analysis. According to the equation, first-order time derivative should belong to the same function space as the Laplacian. The appropriate Gelfand triple here is $H^{1}\subset L^2\subset H^{-1}$ in which the Laplacian is an operator from $V=H^1$ to $V^*=H^{-1}$.