improvement ideas:
more RD
2pass IDR: 1st pass uses no IDR frames, 2nd decides which I-frames could be IDRified without too much loss. (also allows long-term refs)
weighted prediction (P). (implemented, but lacking heursitics)
adjust reference frame cost in B-frames? I don't take it into account when comparing different mb types.
adjust fast skip for custom quant matrices. (currently always uses flat matrix)
psy adaptive quant: luminance/frequency/motion masking, edge enhancement, segmentation, region of interest, ...
psy motion search: smooth motion fields, consider future mvp, temporal direct, try to find real mv instead of min satd.
psy rd: use vssim, dvq or another HVS quality metric for decisions instead of psnr.  at a minimum, use for adjusting frame quantizer and in rate control.  at a maximum, use for motion estimation directly (potentially *very slow*).

optimization ideas:
mb partitioning prediction
reuse me info from previously-tried mb partitioning modes
simultaneous motion search for multiple partition sizes, since SA(T)D or even RD scores can be reused.
prediction-based subpel search (http://research.microsoft.com/asia/dload_files/group/mcom/2005/PredictionDF.pdf)
remember when subpel search terminates prematurely, if using a subpel_refine mode that runs both before and after deciding block type.
free 2pass stats when done initting (saves ~4MB RAM per hour of video). (but no big deal, since that memory can be swapped out.)
not all frames need all the video planes. queued frames (fenc and delayed B-frames) don't need the hpel interpolated planes, while old refs don't need the lowres planes.
center-biased frame selection (when using ref>1)
profile memory access, more prefetches
even faster first pass: skip encoding and just do the lowres? 1% sample and then crf?