It's nice to be able to compare one search engine to another in some unbiased way.  You point a couple search engines at the same documents and run some searches.  The test tool then pops a numerical grade.  These grades can either compare each engine to what humans have said are the correct answers, or compares one engine directy to another, and at least determins which one did better.  TREC has been one of the groups trying to perfect this type of testing over the years.

As simple as this sounds, quite a few issues come up that make this more difficult than you might think:

Some of these problems are so time consuming that other groups have taken a radically different approach.  Instead of worrying about the "right" or "wrong" answers, they just have users try both engines, and see which one they like better!  This type of AB Testing is much more common on the World Wide Web.  Of course there are also many technical details with these tests.  For example, should users specifically tell you which search results they like better, or should you just randomly change it behind the scenes and see which one produces more clicks?

Things get a bit more technical at this point.  Below is an outline and links to more detailed pages.  Remember this is a work in progress, and something you can contribute to.

Two General Types of Testing

Beyond Precision and Recall: How Engines are Judged

Sources of Variance, AKA "Problems"

Multilingual Search Evaluations

Non-Textual Search Evaluations

Code Implementation

Data Considerations