Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

But there's an anti-economy of scale with producing a complete grid.  As the number of source documents and searches/topics grow, the number of assertions grows geometrically.  For M documents and N searches, there are M x N slots in the assertion grid.  This adds up very quickly!  A very small corpus of 100 documents, evaluated for 30 searches, means 3,000 boxes to fill in.  This doesn't sound like much, unless you're the person filling it in!  A more reasonable corpus of 2,000 documents and 250 searches would mean a half million potential assertions.  This is an absurd amount of "boxes" for a small team to fill in.

With a giant virtual grid to fill in, techniques can be employed to break up the work into patches; a small subset of 100 documents and 10 searches would give a patch of just 1,000 assertions that one person could potentially fill in.  And various techniques could be employed to skip some combination of searches and documents, presuming there are none that are related.

Another optimization is to cluster a slightly larger set of documents, and then from each cluster delegate a few documents as test searches.  Those documents are removed from the corpus, and retasked as searches.  However, since they came from a particular cluster, we could assume that those searches are at least moderately relevant to the cluster of docments they were extracted from.

Or a team could use existing search engines to run the test searches against the corpus and evaluate matching documents deeply down into the result list.  So a grader runs the search and notices 4 relevant documents in the first 25 matches.  As a check, they could continue scanning the results 10 times further, past document result 250.  And then, if no other matches are found, they decide that the original 4 documents in the first 25 results really were the only relevant documents.  For thoroughness, perhaps 3 different search engines are used, and the tests are repeated by 2 other graders as well.  This is still labor intensive, but not geometrically so.

All of these optimization techniques have flaws:

Gathering Relevancy Assertions

...