I am reading Tu's Introduction to Manifolds. Given a smooth function $F:N \to M$ between manifolds and $p \in N$, he defines the differential $F_*:T_pN \to T_{F(p)}M$ as follows.
If $X_p \in T_pN$ is a tangent vector (i. e. a derivation on the germ of $C^{\infty}$ functions at $p$) and $f: M \to \Bbb{R}$ is a smooth function, then $F_*(X_p)$ is the tangent vector at $F(p)$ which acts on $f$ as $F_*(X_p)f = X_p(f \circ F)$.
He shows that this definition generalizes the derivative of a smooth function between Euclidean spaces, and that the operator taking $F$ to $F_*$ satisfies a functorial property (the chain rule).
Clearly this is the "correct" generalization of the derivative. However, just looking at the formula for the differential $ (F_*(X_p) = X_p(\cdot \circ F))$, it is difficult to immediately see that it has anything to do with the derivative of $F$-- particularly, the identification of tangent vectors with point derivations does a lot to obscure it for me.
For lack of a better phrasing, is there some way to arrive at this formula from first principles, or to show that it satisfies some kind of desirable universal property? I just need some motivation-- Tu merely presents it and proves its properties afterwards. Also, when did this definition originate?