Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

All of these optimization techniques have flaws:

Skipping certain patches of documents that are very unlikely to have any relation to a set of searches makes assumptions that may not always be true.  Suppose for example I assume that searches related to healthy eating have nothing to do with skiiing.  But perhaps there was an article about maintaining good health that estoled the virtues of nutritious food and outdoor winter sports - this document is relevant to both domains and therefore shouldn't be automatically skipped.  Perhaps this one examle is considered a fluke.  But then another document is found that discusses all of the decadent and healthy food served at various ski lodges.  And then a third that talks about winterized insulated canteens, which talks about clean water and specific outdoor activities; perhaps this document is deemed only marginally related eating because water is just "drinking" and very common, and the article mentions many outdoor winter activities and skiiing is only mentioned once in passing.  But there's still some relevancy.  In a Yes/No grading system perhaps it's decided that's is a "no", but on a percentage basis, the article is still marginally related to healthy eating and should really get at least a 25% relevance.  A statistician might argue that these instances are outliers that don't sigficantly affect the overall score, which might be true, but it still diverges from "perfection" of a completely filled in grid.

Clustering documents, and then repurposing a few documents from each cluster as test searches, just gives a starting point for grading.  So searches born from a particular cluster of docs are given a default grade of "B" for all of their siblings left in the cluster, and a default grade of "F" for documents in all other clusters.  In this method it's obvious that there will be numerous rading errors.  Documents within a cluster likely more relevant to some of their siblings than others.  And at least a few of those documents are likely related to documents in other clusters.  This becomes even clearer if you recluster the documents, or use a different clustering algorithm.  One fix is that humans might still double check the work, perhaps scanning the default grades in some patches, and corrected where needed.  But this goes back to the M x N mathematics, although perhaps double checking grades is much faster than creating them from scratch.  And there could even be a means of tracking and then predicting which patches need the most attention.  But again all these deviate from "perfection".

Gathering Relevancy Assertions

...