Introduction
When I was a kid, I thought computer graphics was the coolest thing ever. When I tried to learn about graphics, I realized it was harder than I thought to create those super slick programs I'd seen growing up. I tried to hack my way through by reading things like the OpenGL pipeline specs, blogs, websites, etc. on how graphics worked, did numerous tutorials, and I got nowhere. Tutorials like NeHe's helped to see how to set things up, but I would misplace one glXXX() call, and my program would either not work or function exactly as before without my new additions. I didn't know enough about the basic theory to debug the program properly, so I did what any teenager does when they're frustrated because they aren't instantly good at something...I gave up. However, I got the opportunity a few years later to take some computer graphics classes at the university (from one of Ivan Sutherland's doctoral students, no less) and I finally learned how things were supposed to work. If I had known this before, I would have had a lot more success earlier on. So, in the interest in helping others in a similar plight as mine, I'll try to share what I learned.The Idea Behind Graphics
Overview
Let's start by thinking about the real world. In the real 3D world, light gets emitted from lots of different sources, bounces off a lot of objects, and some of those photons enter your eye via the lens and stimulates your retina. In a real sense, the 3D world is projected on to a 2D surface. Sure, your brain takes visual cues from your environment and composites your stereoscopic vision to perceive the whole 3D space, but it all comes from 2D information. This 2D image on your retina is constantly being changed just by things moving in the scene, you moving in relation to your scene, lighting changing, and so on. Our visual system processes these images at a pretty fast rate and the brain constructs a 3D model.Horse movie image sequence courtesy of the US Library of Congress.
If we could take images and show them at a similar or higher rate, we could artificially generate a scene that would seem like a real space. Movies basically work on this same principle. They flash images from a 3D scene fast enough that everything looks continuous, like in the horse example above. If we could draw and redraw a scene on the computer that changed depending on motion through the scene, it would seem like a 3D world. Graphics works exactly the same way: it takes a 3D virtual world and converts the whole thing into an accurate 2D representation at a fast enough rate to make the brain think it's a 3D scene.Constraints
The human vision threshold to process a series of images as continuous is about 16 Hz. For computer graphics, that means we have at most 62.5 milliseconds to do the following:- Determine where the eye is looking in a virtual scene.
- Figure out how the scene would look from this angle.
- Compute the colors of the pixels on the display to draw this scene.
- Fill the frame buffer with those colors.
- Send the buffer to the display.
- Display the image.
Basic Graphics Theory
All the World's a Stage
Painting by the infamous Bob Ross courtesy of deshow.net.
Let's begin with an example. Let's say you're in a valley with mountains around you and a meadow in front of a river, similar to the Bob Ross painting above. You want to represent this 3D scene graphically. How do you do it? Well, we can try to paint an image that captures all the elements of the scene. That means we have to pick an angle we want to view the scene with, paint only the things we can see, and ignore the rest. We then have to determine what parts of which objects are behind others. We can see the meadow but it obscures part of the river. The mountains are way in the distance, but they obscure everything behind them, so we can ignore those objects behind the mountain. Since the real physical size of the scene is much bigger than our canvas, we have to figure out how to scale the things we see to the canvas. Then we can paint the objects, taking the lighting and shadows into account, the haze of the mountains in the distance, etc. This is a good analogue to how computer graphics processes a scene. The main steps are:- Determine what the objects in the world look like.
- Determine where the objects are in the world.
- Determine the position of the camera and a portion of the scene to render.
- Determine the relative position of the objects with respect to the camera.
- Draw the objects in the scene.
- Scale the scene to the viewport of the image.
Object Coordinates - Breaking up objects
How do we draw objects on the screen quickly? Computers are really great at doing relatively simple commands a lot of times in succession really fast. So, to take advantage of this, if we were able to represent the whole world with simple shapes, we could optimize graphics algorithms to process a lot of simple shapes really fast. This way, we don't have to make the computer recognize what a mountain or a meadow is in order to know how to draw it. We'll have to create some algorithms to break our shapes down to simple polygons. This is called tessellation. Although we can use squares, we'll probably use triangles. There are lots of advantages to them, such as that all triangle points are co-planar and the fact that you can approximate just about anything with triangles. The only problem we have is that round objects will look polygonal. However, if we make the triangles small enough, like 1 pixel in size, we won't notice them. There are lots of methods on the "best way" to do this and it might depend on the shape you're tessellating. Let's say we have a sphere that we want to tessellate. We can define the local origin of the sphere to be the center. If we do that, we can use an equation to pick points on the surface and then connect those points with polygons that we can draw. A common surface parameterization for a sphere is \(S(u,v) = [r\sin{u}\cos{v}, r\sin{u}\sin{v},r\cos{v}]\), where u and v are just variables with a domain of \(u\in[0,\pi],v\in[0,2\pi]\) and r is the radius of the sphere. As you can see in the above picture, the points on the surface are drawn with rectangles. We could have just as easily connected them with triangles. The points on the surface are in what we can call object coordinates. They are defined with respect to a local origin, in this case, the center of the sphere. If we want to place them in a scene, we can define a vector from the origin of the scene to the point we want to place the sphere's origin, and then add that vector to every point on the sphere's surface. This will put the sphere in world coordinates.World Coordinates - Putting our objects in the world
We really start our graphics journey here. We define an origin somewhere and every point in the scene is defined by a vector from the origin to that point. Although it's a 3D scene, we'll define each point as a 4-dimensional point \( [x,y,z,w] \), which will map to a 3D point at coordinates \([\frac{x}{w},\frac{y}{w},\frac{z}{w}]\). This kind of mapping is called homogeneous coordinates. There are advantages to using homogeneous coordinates, but I won't discuss them here. Just know we want to use them. A problem presents itself if we want to move around in our scene. If we want to move our view, we can either move the camera to another location, or just move the world around the camera. In the computer, it's actually easier to move the world around, so we do that and let the camera be fixed at the origin. The modelview matrix is a 4x4 matrix that we can use to move every point in the world around and keep our camera fixed at its location. This matrix is basically a concatenation of all the rotations, translations and scaling that we want to do to the scene. We multiply our points in world coordinates by the modelview matrix to move us into what we call viewing coordinates: \[ \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{view} = [MV] \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{world} \]Viewing Coordinates - Pick what we can see
After we've rotated, translated, and scaled the world, we can select just a portion of the world to consider. This we do by defining a viewing frustrum, or a truncated pyramid. This frustrum is formed by defining 6 clipping planes in viewing coordinates. The idea is that everything outside this frustrum will be clipped, or discarded, when drawing the final image. This frustrum is defined in a 4x4 matrix. The OpenGL glFrustrum() function defined this matrix as follows: \[ P = \left [ \begin{matrix} \frac{2*n}{r-l} & 0 & \frac{r+l}{r-l} & 0 \\ 0 & \frac{2*n}{t-b} & \frac{t+b}{t-b} & 0 \\ 0 & 0 & -\frac{f+n}{f-n} & -\frac{2fn}{f-n} \\ 0 & 0 & -1 & 0 \\ \end{matrix} \right ] \]Picture courtesy of Silicon Graphics, Inc.
We can adjust this matrix for perspective or orthographic viewing. Perspective has a vanishing point, but orthographic views don't. Perspective views are what you usually see in paintings, orthographic views are seen on technical drawings. Because this matrix controls how the objects are projected onto the screen, this is called the projection matrix. Here, t,b,l,r,n,f are the coordinates of the top, bottom, left, right, near, and far clipping planes. Multiplying by the projection matrix moves the point from viewing coordinates to what we call clip coordinates: \[ \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{ndc} =[MV] \left [ \begin{matrix} x \\ y \\ z \\ w \\ \end{matrix} \right ]_{world} \]