Relevancy Outline

The intent of page is a high level subject overview, and then to break off to separate pages. If a section gets too big, please consider making a new page and linking to it.

Two General Types of Testing

Absolute Truth / Matrix / Grid / TREC
- The correct answers for each search are known ahead of time
- Humans judges often decide these correct answers
- Can be labor intensive to setup
AB Testing / User Preference
- Tracks explicit or implicit preferences between engines A/B
- Often dispenses with the notion of the "correct" answer
- Can be easier to setup, but some fear the best answers will be missed by both engines

Beyond Precision and Recall: How Engines are Judged

Binary vs. Non-Binary Grading Systems
- Early TREC had binary judgements, only Yes/No on whether each doc was related to a test search
- More choices were later added
- A system can use letter grades (A, B, C, D and F) or numeric grades
- Another style asks testers to sort documents in their preferred order
Classic Measurements: Precision and Recall
- Recall: "Did I find all the documents I expected to get back? What percent?"
- Precision: "Did the system bring back other documents that weren't relevant? What percent were on target?"
Newer Ideas:
- Rank: The order of documents that were returned
  - Generally a 1 in 20 match in the #1 spot is better than a 50% rate where all matches are on the second page.
- Interactivity: What navigators or visualization were given to help the user iteratively drill down and find what they were looking for
  - Facets and sorting: Clickable filters and sort options
  - Unsupervised Clustering: Related terms or phrases, or related searches
  - Spelling and thesaurus suggestions
- Subject Disambiguation, Sentiment, Conflicting Information, crowd hints
  - kidney bean or kidney cell
  - "best football team in the UK"
Mathematical assessments generally covered under implementations.

Sources of Variance, AKA "Problems"

Different Goals
- Perfect/Human vs. Best vs. Acceptable vs. Better than X
- Constrained vs. Unconstrained Resources (time, cpu, storage)
Sample Size
- Amount of Data
  - Fixed set or growing over time
- Number of Testers (AB or Relevancy Judgments)
- Number of Searches
Vertical vs. Horizontal Content
- One extreme: Specific demo may cover just one discipline, for example Medical Journals
- Other extreme: Internet covers vastly disparate domains
Vocabulary Variation / Mistmatch: Search vs. Content
Users
- Experienced vs. New Searcher
- Subject Expert vs. Novice
- Spelling, typing and computer proficiency
- Reading Level, Native Language, IQ
- Interface Medium (large visual display, small text display, audible, Braille, etc)
- Amount of Effort to understand Search
- Willingness to Iterate
- Searching for specific answer vs. General Exploration
Type of Searches
- Length / 1 or 2 words
- Full question
- Sample text
- Internet Boolean
- Advanced Boolean / Syntax / Proximity
  - Wildcard, Wildcard, etc.
Abbreviations
Punctuation
- Chemical
- Source Code
- Units of Measure
- Literal vs. Search Operator
Popular vs. Outlier / Researcher
Potential for Shared Search Engine Biases
- TF/IDF
- Shared Thesaurus
- Similar fuzzy matching (Snowball, Soundex, etc)

Multilingual Search Evaluations

Non-Textual Search Evaluations

Code Implementations

Data Considerations

General Considerations
- Character Encoding
- Simple record files vs. XML
- Interchange with Excel / OpenOffice / Numbers / Google Docs
- Interaction w/ Databases...
- Interaction w/ OpenSearch
- Version Control
Specific Entities to Store
- Sample Documents
- Searches (AKA TREC Topics)
- Relevancy Judgments (AKA TREC qrels)
- AB Preferences (click-throughs, explicit ratings, etc)
- Controlled Index vs. Federated Search (TODO: explain)
  - Search Engine Results List Formats / APIs
- Textual vs. Non-Textual Data
  - Corpus, Searches and Judgments
  - See other section for non-text discussion

Child pages

Relevancy Outline