Here, I use the verb "MC" to mean: take an MV and interpolate the reference frame at that offset, returning the interpolated block.

The EPZS algorithm:
Make one or more predictions of a fullpel MV (usually the median of neighboring blocks' MVs, rounded to fullpel)
That's a pointer to some position in the previous frame.
For each prediction, compare the current block to that location (using SAD/SSD/SATD)
Keep the best one.
Repeat for each position neighboring the best prediction.
If any of the neighbors is better than the predicted MV, move there and recurse. Otherwise, stop.
Once you settle on the best fullpel MV, do it again for halfpel and then qpel (i.e. define a "neighbor" to be +/- .5 or .25 pixels), except that you now have to MC to find the reference block, since it's not just copied from the previous frame.
lavc uses EZPS diamond, so named from the position of the 4 neighbors searched. x264 uses the same algorithm except that at each step it checks 6 neighbors, in a hexagon.

To decide block sizes:
In ASP, the only allowed block sizes are 16x16 and 8x8. So run EPZS on a 16x16 block, then run it on 4 8x8 blocks. The residual (SATD + some tablized costs based on MVs) is an estimate of the number of bits you'll need to code the macroblock this way. So compare the sum of residuals to choose which size to use. Or if you want a little better quality at the cost of a bunch of speed, then really encode the macroblock each way, and count bits + distortion.
In H.264 I do the same, and then if 8x8 is better than 16x16 (i.e. if the block has non-uniform motion and you're encoding at low enough QP to be worth coding it as such) also check 16x8 and 4x4, and then if 4x4 is better than 8x8 also check 8x4. (Smaller blocks start with a MV predicted from the larger block, in addition to the neighbors.)
This is not the only possible method. JM first tries all the 4x4 blocks, then predicts larger blocks from smaller. (Which is faster if you're going to test all the sizes anyway.)

In Snow:
OBME uses something similar to EPZS, but instead of a simple SAD/SSD comparison, for each MV it MCs the whole area of effect (twice the block size), calculates the overlap with the neighbors, and then compares against the input frame. (This is obviously at least 4x slower.) Then you can't just sum residuals of separate blocks, but have to run OBMC again for each block size once the MVs are decided. Also, since the prediction of a given pixel depends on multiple MVs, you can get some more quality by jointly optimizing the MVs: run another ME pass now that all of the neighboring blocks are known.