I think the question is related to why the space of images is 1-million dimensional, and not 3 dimensional, and why studying these spaces might be useful.
Well, think about the 3 dimensional space of RGB colors: what is a point in that space? It's simply a vector with 3 coordinates: $$. So that's just the information for one pixel, not for a whole image. Any point in that space carries only the information about how much red, green and blue one pixel has.
Now think about what would you do if you needed to store the information of color for TWO pixels: you would need one axis to represent the Red color of pixel 1 and one axis to represent the Red color of pixel 2; one for the Green color of pixel 1, one for Green color of pixel 2; and one for Blue color of pixel 1, and one for Blue color of pixel 2. So you would have 6 axis: 3 for each pixel.
If you want to represent the color of 1 million pixels, you need 3 * 1 million axis. These axis are orthogonal in the sense that if you want, you can change the color of just one pixel without having to adjust the color of any other pixels. So you would have a space with 3 * 1 million dimensions. Each point in that space now corresponds to an image: specifically, the coordinates along each of the 3 * 1 million axes gives you the value of R, G, or B for a given pixel.
So, in effect, if you want to have a space of IMAGES, and not a space of PIXELS, you need a 3 * 1 million-dimensional space, not a 3D space.
Now think about what would happen if you could take, for instance, 50 pictures of human faces and plot them in this 3 million-dimensional space, and see where they lie. Of course you can't visualize this, but you might expect that these pictures have SOMETHING in common (after all, they are all pictures of human faces, not arbitrary pictures of anything, or crazy combinations of pixel colors). If you could see how these images are spread over that space (where the points corresponding to those pictures are, that is), you would see that they usually are not located ANYWHERE. For instance, human faces have typical colors -- they tend not to be green, for example. That means that the regions of the space where you would expect to see green pixels are probably sort of empty. That's what they mean when they talk about identifying the subspace of human faces. It's just the "surface", or sub-regions of that 3*1 million-dimensional space where you'd expect to find points corresponding to human faces. Typically that sub-region might be described with fewer information than 3*1 million coordinates, if you just find a better representation for your image, instead of one that stores the value of each and all r,g,b component for all pixels. That is why image compression is possible: if you just find the right way of representing your information (like the value of your 3*1 million RGB components), you might do that with LESS than 3*1 million numbers; specifically, since this numbers have a pattern.
It is possible to try to identify the FORM of the sub-region of the 3-million-dimensional space where human faces tend to appear. Then given another image which you don't know if it's a human face or not, you could try to GUESS if it is a human face. How? Well, check if the point corresponding to that new image, when plotted on that 3*1 million-dimensional space, is close to the sub-region where the points of human faces usually are. Sometimes they call the process of identifying this subregions by the name of "manifold learning".
Ok, lots of information. Just think about it for a while. It's hard (actually impossible) to visualize spaces with more than 3 dimensions, but once you get the idea of what's going on, often you'll see that your intuitions about what happens in 2D or 3D carry on.
Try this exercise: imagine a black-and-white image with just 3 pixels; each pixel is then just a value between 0 and 1 (0 being completely black and 1 being completely white, and values in between being shades of gray). Now imagine the set of images where the first pixel is darker than the second one, and the second one is darker than the third one. That is, images like this:
$< 0.3, 0.8, 1.0 >$ or $< 0.12, 0.53, 0.7 >$
Now generate a bunch of those (say, 10.000 images like that) and plot them in the 3D where they lie. Notice that there is a pattern in these images: the value of the 2nd pixel is always larger than the value of the 1st pixel; the value of the 3rd pixel is always larger than the value of the 2nd. Clearly we shouldn't expect points corresponding to images like this to occupy ANY place in the 3D space. For example, we would most certainly not see points in the space close to < 0.5, 0.3, 0.1 >.
We can actually see how 10.000 of such images in this plot, where I show 10.000 images like the ones I described. Since each axis corresponds to the value of one of the 3 pixels in a given image, we have 3 axes. Each point in the plot is thus a 3-pixel image.

Notice how the points are occupying just a small part of the whole 3D space. That happens because there is a relation between the values of the pixels. Images of that type all have something in common, so they occupy similar portions of the whole 3D space.
The same way, if you could plot 1-million pixel images (which would lie in a 3*1-million dimensional space, as mentioned before), and all of those images corresponded to images of human faces, you would see some pattern like the one I showed above. Specifically, the points corresponding to images of human faces would most likely NOT occupy the whole 3-million dimensional space. We could actually try to estimate the "shape" of the sub-region where human faces are by using techniques called "manifold learning".
Now, notice that you could use the same ideas as above to analyze any other kind of data. Results of a statistical survey? Imagine that you have 50 questions, each one being a value from 0 to 100. You ask those questions to one person and get 50 numbers back. You ask them to another person and get another 50 numbers. How to "visualize" them? Plot them in a 50-dimensional space where each axis corresponds to the value of one given answer. A point in that space then corresponds to 50 numbers (specifically, the answers given in one specific survey). If you plot, say, 1000 of these surveys in this space, you would get 1000 50-dimensional points. Maybe there is some pattern can be found; maybe there isn't. If there is, it might be the case that these 1000 50-dimensional points lie in a subspace (or sub-region) of the 50-D space. That is what Terence Tao was saying when he said that it is useful to study these subspaces, or sub-regions, and when he said that the "subsets of this space correspond to various classes of images."
Hope that helps!
Bruno