It's nice to be able to compare one search engine to another in some unbiased way. You point a couple of search engines at the same set of documents , and have them run some test searches, and out searches. The test tool then pops a numerical grade. The These grades can either compare each engine to what humans have said are the correct answers, or compares one engine directy to another, and at least determins which one did better. TREC has been one of the groups trying to perfect this type of testing over the years.
- You need to get a large set of documents and searches to test with. This can be a problem for copyright reasons, among other things.
- You need to come up with searches that represent the types of things real users would search for, and get everybody to agree on what's a fair test search.
- Getting humans to sit down and record what the best matches should be is very time consuming for large data sets.
- Then you need to decide which mathematical formulas you'll use to crank out the final scores. For example, is it better to have an engine that gives modestly relevant results almost all the time, or an engine that gives really good answers sometimes, better on average than the other engine, but sometimes gives back complete garbage?
Some of these problems are so time consuming that other groups have taken a radically different approach. Instead of worrying about the "right" or "wrong" answers, they just have users try both engines, and see which one they like better! This type of AB Testing is much more common on the World Wide Web, but . Of course there are also many technical details to work outwith these tests. For example, should users specifically tell you which search results they like better, or should you just randomly change it behind the scenes and see which one produces more clicks?
Things get a bit more technical at this point. We present an overview, and then Below is an outline and links to more detailed pages. Remember that this is a work in progress, and something you can can contribute to.
Two General Types of Testing