Allow me to shamelessly copy an answer of mine to this question first.
The idea behind Strassen multiplication is the same as the idea behind Karatsuba multiplication.
In this, if we have two numbers and a base $B$ (maybe $10^6$, say), and the numbers are written $a = x_0B + x_1$, $b = y_0B + y_1$, and we want to calculate $ab$, then naively we might say that it's
$ab = x_0 y_0 B^2 + (x_0 y_1 + x_1 y_0)B + x_1 y_1 \qquad$ (requiring 4 multiplications)
But we can be wittier, as $x_0y_1 + x_1y_0 = (x_0 + x_1)(y_0 + y_1) - x_0y_0 - x_1y_1 $
So we can calculate $ab$ by knowing $x_0y_0, x_1y_1$ and $(x_0 + x_1)(y_0 + y_1)$, which only uses 3 multiplications. This is just a slick idea, akin to the many other slick arithmetic tricks out there that might be found by inspection, and it became popular in the 1960s. (There's a good story behind it - Karatsuba responded to a challenge of Kolmogorov, ultimately culminating with Kolmogorov writing a paper in Karatsuba's name).
For matrices, a direct port of Karatsuba doesn't quite work. But it's easy to verify that Strassen is true (just write it out). But it's very unclear how Strassen came across them. But it should be mentioned that Strassen was working on many numerical-methods style techniques to improve matrix calculations. It should also be mentioned that Strassen has an improved algorithm, better than this technique, using fast fourier transforms as well.
One might ask: well, we can do it with 7 multiplications. Can we do it with 6? Strassen multiplication has been around since 1969, so surely it's known? This was an open problem until just a few years ago, when Landsberg showed that 7 is optimal. It's nontrivial and a bit far afield for me, but his (corrected after it was published) article can be found on the arXiv.
One might also wonder about $3 \times 3$ matrices. How few multiplications are necessary? Naively, it takes 27 multiplications. But it can be done in 23. (See Julian D. Laderman, A noncommutative algorithm for multiplying 3×3 matrices using 23 muliplications, Bull. Amer. Math. Soc. 82 (1976) 126–128, MR0395320 (52 #16117).) Is that optimal? That's unknown - all that's known is that it takes at least 19. (But if you're just interested in computational savings, even 19 gives less savings when iterated than the $2 \times 2$ case).
So in short, $7$ is optimal. It's not obvious that it's optimal. It was a very clever idea, even, to find the $7$.