The finite field $GF(2^8)$ is better thought of a collection of $7$th degree polynomials modulo $2$ and some (irreducible) $8$th degree polynomial $P$. Let's see what all of this means.
First, each member of $GF(2^8)$ is of the form $\sum_{i=0}^7 a_i x^i$, where the $a_i$ are coefficients "modulo $2$", i.e. bits. When you add two of the $a_i$s, you calculate the answer modulo $2$, so it's the same as XORing.
Adding two polynomials is easy: $\sum_{i=0}^7 a_i x^i + \sum_{i=0}^7 b_i x^i = \sum_{i=0}^7 (a_i+b_i) x^i.$ So addition corresponds to XOR.
Multiplication is more involved. For AES the polynomial in question is $P(x) = x^8 + x^4 + x^3 + x + 1.$ Written in bits, it is $(100011011)_2 = (283)_{10}$. The point is that $x^8 = x^4 + x^3 + x + 1$ (since everything's mod $2$), so in order to multiply a polynomial by $x$, you do the following:
- Remember the MSB (that's the power of $x^7$).
- Shift the number left once (replacing $x^i$ with $x^{i+1}$).
- If the old MSB was $1$, XOR $(11011)_2$ (since $x^8 = x^4 + x^3 + x + 1$).
Using this primitive, you can multiply two polynomials $A = \sum_{i=0}^7 a_i x^i$ and $B = \sum_{i=0}^7 b_i x^i$ as follows:
- Initialize the result $C = 0$.
- Add to $C$ the value of $a_0 B$, that is, if the LSB of $A$ is $1$, XOR $B$ to $C$.
- Add to $C$ the value of $a_1 x B$, that is, if the second least bit of $A$ is $1$, XOR $xB$ to $C$; to calculate $xB$, use the method above.
- And so on, until you get to $a_7$.
- Return $C$.
In practice, you implement it as a loop:
- Initialize $C = 0$.
- If $LSB(A)=1$, $C = C + B$.
- Set $B = xB$, $C=C/2$ (i.e. shift $C$ right once).
- Repeat the previous two steps $7$ more times.
In a real implementation, this multiplication table is stored in some condensed form. At the very least, you store the table of multiplication by $x$. The other extreme is storing all $2^{16}$ possible products. In between, you can store the product of any $A$ by any $2$-bit $B$ (size $2^{10}$ bytes), any $3$-bit $B$ (size $2^{11}$ bytes) or any $4$-bit $B$ (size $2^{12}$ bytes). It all depends on how much memory you can spare, and on the trade-off between memory access (i.e. cache sizes) and ALU performance.