Let $\rho_0$ and $\rho_1$ be two probability density functions on $\mathbb{R}^d$. The Wasserstein metric is defined as
$W_p(\rho_0,\rho_1)^p = \inf E(|X-Y|^p)$ where the infimum is taken over all joint distributions of random variables $X$ and $Y$ whose marginals are $\rho_0$ and $\rho_1$ respectively.
I read a paper in which $W_p(\rho_1,\rho_2)$ was defined differently-
We say that a map $M: \mathbb{R}^d \to \mathbb{R}^d$ realizes a transfer of $\rho_0$ to $\rho_1$ if, for all bounded subsets $A$ of $\mathbb{R}^d$, $\int_{x \in A} \rho_1(x)dx = \int_{M(x) \in A}\rho_0(x)dx$ In particular, if $M$ is a smooth one-to-one map, the above just means $\det(\nabla M(x))\rho_1(x) = \rho_0(x)$ Using this, they defined the Wasserstein distance as $W_p(\rho_0,\rho_1)^p = \inf_M \int|M(x)-x|^p\rho_0(x)dx$ for all maps $M$ which transport $\rho_0$ to $\rho_1$.
My question is, why are the above definitions equivalent?