Two Types of Testing

We can generally talk about two types of relevancy testing:

  1. Absolute Truth / Matrix / Grid / TREC / Relevancy Assertions
  2. AB Testing / User Preference

Of course these can be further subdivided.  Or it could be argued that some tests might have characteristics of both, or tests could be categorized in another way.  Most tests seem to fit, at least loosely, into one of these two categories.  Happy to hear your thoughts!

Relevancy Assertion Testing

The central idea here is that you know what the answers are supposed to be ahead of time.  Humans (or some other judge) has compared every test search against every document and answered the question "is this document relevant to this search?", or "how relevant is this document to the search".  This set of Relevancy Assurtions can be throught of as an "Absolute Truth" grid or matrix.  This is the type of testing TREC focused on.

The main characterics of Relevancy Assertion Testing are:

Contrasting this with "AB Testing"

AB Testing displays the results from two or more search engines and records which search results users prefer.  The "A/B" refers to search engine A and search engine B.

There are many variations on this.  Modern web sites may show some users results from search engine A while other users see results from engine B, and the site tracks which search engine on average generates more clicks or purchases.  More formal testing may show users results from both A and B and ask them to judge which results are more relevant.

The main characteristics of AB Testing are:

This alternative type of testing is discussed here (TODO: link to page once created)

Now... back to Relevancy Assertions!

Types of Relevancy Assertions

Most people think of TREC when they think of this type of testing.  Certainly TREC-style assertion testing is important, but it's only one subtype of assertion testing.

Full-Grid Assertions (TREC-Style!)

TREC-style testing is well known and represents one end of a spectrum of Relevancy Assertion Testing, with rigorous data curation and a complete assertion grid.  The creation of the relevancy assertion grid itself is also carefully controlled, and the entire grid is considered to have been populated.

But there's an anti-economy of scale with producing a complete grid.  As the number of source documents and searches/topics grow, the number of assertions grows geometrically.  For M documents and N searches, there are M x N slots in the assertion grid.  This adds up very quickly!  A very small corpus of 100 documents, evaluated for 30 searches, means 3,000 boxes to fill in.  This doesn't sound like much, unless you're the person filling it in!  A more reasonable corpus of 2,000 documents and 250 searches would mean a half million potential assertions.  This is an absurd amount of "boxes" for a small team to fill in.

Using multiple engines and judging the top matches is called "Pooling" and is disussed here:

All of these optimization techniques have flaws:

"Flaws" in the sense that it still requires too much, or doesn't insure the absolute best answers have been found.

Other forms of Relevancy Assertions

...

Order vs. grade

Detla grading

Web index

domain disambiguation