The probabilistic definition of independence is related to the idea of causality (actually they are opposite concepts). If knowledge of one event has an effect on the occurrence of another event, there is a dependence. Independence, on the other hand, is a certain balance on the spectrum between exclusive and "inclusive" events.
Probability also talks about an event space, the set of all possible outcomes of an "experiment". For example, rolling one die, or rolling two dice.
Your first confusion is that if $A$ is rolling a $1$ or $2$ and $B$ is rolling a $5$ or $6$, are these in the same event space (are they referring to the same roll of the die) or different event spaces (different rolls of the die)? Given a fair die, $A$ & $B$ will be independent if they are from different event spaces. But if they both refer to the same roll of the die, then of course they are mutually exclusive. In this case, their is a strong dependence at work: $ \matrix{ P(A)=\frac13 &\qquad& P(A|B)=0 &\qquad& P(A|\overline{B})=\frac12 \\\\ P(B)=\frac13 &\qquad& P(B|A)=0 &\qquad& P(B|\overline{A})=\frac12 } $ One source of confusion is that when we say $A$ and $B$ come from separate events, we probably mean they live in different event spaces, for example, the first and second roll of a die. But if we say they are distinct events, it is perhaps unclear whether we mean they are different events in the same space (from the same roll of the die) or from different spaces. Unfortunately the problem of how to clearly describe identity and difference of object types and instances cannot be avoided, and one must be careful to clarify this both as a reader and writer.
Given any events $A$ and $B$, we can draw a Venn diagram representing them, perhaps with the areas representing their probabilities $P(A)$ and $P(B)$. The area of intersection is then the probability $P(AB)$ of both events being true. If the events are mutually exclusive, then the areas have no intersection and so $P(AB)=0$ (each event precludes the other).

At the other extreme, one event may include the other, in which case one area is inside the other. For example if $A$ is contained in $B$, i.e. if $A \implies B$, then $P(AB)=P(A)$, and the conditional probability $P(B|A)=\frac{P(AB)}{P(A)}$ of $B$ given $A$ is thus $1$.

So a general law is that $0\le P(AB) \le \min\left( P(A),~ P(B) \right) $ $0\le P(A|B)=\frac{P(AB)}{P(B)} \le \min\left( \frac{P(A)}{P(B)},~1 \right) $ $0\le P(B|A)=\frac{P(AB)}{P(A)} \le \min\left(1,~\frac{P(B)}{P(A)} \right).$ When the sandwitched quantity equals one of the minimum values, we have inclusion/implication.

Between these two extremes of exclusive and "inclusive" events, there is a balance point. That balance point is when $P(AB)=P(A)\,P(B)$, and this is called independence. However, it is not so easy to tell from the diagrams alone whether two events merely intersect, or whether they are in fact independent. This is where we need the numerical formulation of independence. As an exercise, you should convince yourself from the fomula for conditional probability above that
$P(A|B)=P(A) \iff P(AB)=P(A)\,P(B) \iff P(B|A)=P(B).$
Independence says that you can recover the joint probability from the marginal probabilities. If you take a conventional Venn diagram for two events $A$ and $B$ and draw them inside a unit square as rectangles, and if you can do this with $A$ along one axis and $B$ along the other and the areas of the rectangles and their intersection still represent the probabilities, then $A$ and $B$ are independent.

This is not a fluke of geometry. Independence has a geometrical interpretation because of the conceptual connection to causality and its connection to Cartesian coordinates.
It's also good to have a tabular example. Let's say here that we roll two dice, a red die and a blue die. Let $A$ be the event that the blue die is in $\{1,2,3,4\}$ (or any particular four possibilites) and $B$ be the event that the red die is in $\{5,6\}$ (or any two values). These events should be independent. The table below is roughly the mirror image of the graphic above (with a vertical flip). $ \eqalign{ & A\qquad & \overline{A}\qquad & \text{Total} \\\\ B\qquad & \frac29 & \frac19 & \frac13 \\\\ \overline{B}\qquad & \frac49 & \frac29 & \frac23 \\\\ \text{Total}\qquad & \frac23 & \frac13 & 1 } $ We can also re-write the above probabilites as counts in the event space: $ \eqalign{ & A\qquad & \overline{A}\qquad & \text{Total} \\\\ B\qquad & 2 & 1 & 3 \\\\ \overline{B}\qquad & 4 & 2 & 6 \\\\ \text{Total}\qquad & 6 & 3 & 9 } $ Try writing out the next few numeric examples you encounter this way and the concept will become second nature.
Independence means the the marginals factor the joint distribution.
If you are familiar with the concept of Cartesian products of sets...
If we start with a priori unrelated events $A\in X$ and $B\in Y$ and $p_X,~p_Y$ are the probability distributions on $X$ and $Y$, then $p_{X\times Y}(AB)=p_X(A)\cdot p_Y(B)$ defines a probability on the product event space, $X\times Y$, corresponding to all events $A$ of $X$ and $B$ of $Y$ being independent. Independence means that the product space decomposes under the process of marginalization, and that the decomposition is a commutative diagram (i.e. that the reverse process of marginalization, reintroducing each variable, recovers the correct/original joint distribution).
Independence, and conditional independence, are also interesting from the perspectives of Bayesian networks (also called graphical models) and, via entropy, information theory. When there are many variables in a Bayesian network, there is an interesting method of diagramming their relationships called plate notation. In decision theory, there are also influence diagrams.