Why do I divide Z by W in a perspective projection

I guess this is more a math question than it is an OpenGL one, but I digress. Anyways, if the whole purpose of the perspective divide is to get usable x and y coordinates, why bother dividing z by w? Also how do I get w in the first place?

Actually, the explanation has much more to do with the limitations of the depth buffer than it does math.

At its simplest, "the depth buffer is a texture in which each on-screen pixel is assigned a grayscale value depending on its distance from the camera. This allows visual effects to easily alter with distance." ^Source

More accurately, a depth buffer is a texture containing the value of z/w for each fragment, where:

Z is the distance from the near clipping plane to the fragment.
W is the distance from the camera to the fragment.

In the following diagram illustrating the relationship between z, w, and z/w, n is equal to the zNear parameter passed to gluPerspective, or an equivalent function, and f is equal to the zFar parameter passed to the same function.

At a glance, this system look unintuitive. But as a result, z/w is always a floating-point value between 0 and 1 (0/n and f/f), and can therefore be represented as a single channel of a texture.

A second important note: the depth buffer is nonlinear, meaning an object exactly in between the near and far clipping planes is nowhere near a value of 0.5 in the depth buffer. As shown above, it would correlate to a value of 0.999 in the depth buffer. Depending on your view, this could be good or bad; you may want the depth buffer to be more detailed close-up (which it is), or offer even detail throughout (which it doesn't).

TL;DR: