You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

The intent of page is a high level subject overview, and then to break off to separate pages.  If a section gets too big, please consider making a new page and linking to it.

Two General Types of Testing

  • Absolute Truth / Matrix / Grid / TREC
    • The correct answers for each search are known ahead of time
    • Humans judges often decide these correct answers
    • Can be labor intensive to setup
  • AB Testing / User Preference
    • Tracks explicit or implicit preferences between engines A/B
    • Often dispenses with the notion of the "correct" answer
    • Can be easier to setup, but some fear the best answers will be missed by both engines

Beyond Precision and Recall: How Engines are Judged

  • Binary vs. Non-Binary Grading Systems
    • Early TREC had binary judgements, only Yes/No on whether each doc was related to a test search
    • More choices were later added
    • A system can use letter grades (A, B, C, D and F) or numeric grades
    • Another style asks testers to sort documents in their preferred order
  • Classic Measurements: Precision and Recall
    • Recall: "Did I find all the documents I expected to get back?  What percent?"
    • Precision: "Did the system bring back other documents that weren't relevant?  What percent were on target?"
  • Newer Ideas:
    • Rank: The order of documents that were returned
      • Generally a 1 in 20 match in the #1 spot is better than a 50% rate where all matches are on the second page.
    • Interactivity: What navigators or visualization were given to help the user iteratively drill down and find what they were looking for
      • Facets and sorting: Clickable filters and sort options
      • Unsupervised Clustering: Related terms or phrases, or related searches
      • Spelling and thesaurus suggestions
    • Subject Disambiguation, Sentiment, Conflicting Information, crowd hints
      • kidney bean or kidney cell
      • "best football team in the UK"
  • Mathematical assessments generally covered under implementations.

Sources of Variance, AKA "Problems"

  • Different Goals
    • Perfect/Human vs. Best vs. Acceptable vs. Better than X
    • Constrained vs. Unconstrained Resources (time, cpu, storage)
  • Sample Size
    • Amount of Data
      • Fixed set or growing over time
    • Number of Testers (AB or Relevancy Judgments)
    • Number of Searches
  • Vertical vs. Horizontal Content
    • One extreme: Specific demo may cover just one discipline, for example Medical Journals
    • Other extreme: Internet covers vastly disparate domains
  • Vocabulary Variation / Mistmatch: Search vs. Content
  • Users
    • Experienced vs. New Searcher
    • Subject Expert vs. Novice
    • Spelling, typing and computer proficiency
    • Reading Level, Native Language, IQ
    • Interface Medium (large visual display, small text display, audible, Braille, etc)
    • Amount of Effort to understand Search
    • Willingness to Iterate
    • Searching for specific answer vs. General Exploration
  • Type of Searches
    • Length / 1 or 2 words
    • Full question
    • Sample text
    • Internet Boolean
    • Advanced Boolean / Syntax / Proximity
      • Wildcard, Wildcard, etc.
  • Abbreviations
  • Punctuation
    • Chemical
    • Source Code
    • Units of Measure
    • Literal vs. Search Operator
  • Popular vs. Outlier / Researcher
  • Potential for Shared Search Engine Biases
    • TF/IDF
    • Shared Thesaurus
    • Similar fuzzy matching (Snowball, Soundex, etc)

Multilingual Search Evaluations

Non-Textual Search Evaluations

Code Implementations

Data Considerations

  • General Considerations
    • Character Encoding
    • Simple record files vs. XML
    • Interchange with Excel / OpenOffice / Numbers / Google Docs
    • Interaction w/ Databases...
    • Interaction w/ OpenSearch
    • Version Control
  • Specific Entities to Store
    • Sample Documents
    • Searches (AKA TREC Topics)
    • Relevancy Judgments (AKA TREC qrels)
    • AB Preferences (click-throughs, explicit ratings, etc)
    • Controlled Index vs. Federated Search (TODO: explain)
      • Search Engine Results List Formats / APIs
    • Textual vs. Non-Textual Data
      • Corpus, Searches and Judgments
      • See other section for non-text discussion
  • No labels