athlon64, 64bit:
 30.15 cycles in invsqrt_c
 34.02 cycles in invsqrt_q3
 34.02 cycles in invsqrt_union
  3.02 cycles in invsqrt_sse
  3.02 cycles in invsqrt_3dnow

athlon64, 32bit, -march=athlon-xp:
 30.15 cycles in invsqrt_c
 33.53 cycles in invsqrt_q3
 34.04 cycles in invsqrt_union
  3.02 cycles in invsqrt_sse
  3.02 cycles in invsqrt_3dnow

athlon64, 32bit, -march=i386:
 50.88 cycles in invsqrt_c
 33.35 cycles in invsqrt_q3
 30.02 cycles in invsqrt_union

pentium4, 32bit, -march=pentium4:
 46.21 cycles in invsqrt_c
 70.03 cycles in invsqrt_q3
 84.90 cycles in invsqrt_union
  4.06 cycles in invsqrt_sse

pentium4, 32bit, -march=i386:
 98.15 cycles in invsqrt_c
 67.60 cycles in invsqrt_q3
114.05 cycles in invsqrt_union

all of these were compiled with gcc-4.1.1 (and it does matter, gcc-3.4.3 made them much slower).
