The intent of page is a high level subject overview, and then to break off to separate pages. If a section gets too big, please consider making a new page and linking to it.
Two General Types of Testing
- Absolute Truth / Matrix / Grid / TREC
- The correct answers for each search are known ahead of time
- Humans judges often decide these correct answers
- Can be labor intensive to setup
- AB Testing / User Preference
- Tracks explicit or implicit preferences between engines A/B
- Often dispenses with the notion of the "correct" answer
- Can be easier to setup, but some fear the best answers will be missed by both engines
Beyond Precision and Recall: How Engines are Judged
- Binary vs. Non-Binary Grading Systems
- Early TREC had binary judgements, only Yes/No on whether each doc was related to a test search
- More choices were later added
- A system can use letter grades (A, B, C, D and F) or numeric grades
- Another style asks testers to sort documents in their preferred order
- Classic Measurements: Precision and Recall
- Recall: "Did I find all the documents I expected to get back? What percent?"
- Precision: "Did the system bring back other documents that weren't relevant? What percent were on target?"
- Newer Ideas:
- Rank: The order of documents that were returned
- Generally a 1 in 20 match in the #1 spot is better than a 50% rate where all matches are on the second page.
- Interactivity: What navigators or visualization were given to help the user iteratively drill down and find what they were looking for
- Facets and sorting: Clickable filters and sort options
- Unsupervised Clustering: Related terms or phrases, or related searches
- Spelling and thesaurus suggestions
- Subject Disambiguation, Sentiment, Conflicting Information, crowd hints
- kidney bean or kidney cell
- "best football team in the UK"
- Rank: The order of documents that were returned
- Mathematical assessments generally covered under implementations.
Sources of Variance, AKA "Problems"
- Different Goals
- Perfect/Human vs. Best vs. Acceptable vs. Better than X
- Constrained vs. Unconstrained Resources (time, cpu, storage)
- Sample Size
- Amount of Data
- Fixed set or growing over time
- Number of Testers (AB or Relevancy Judgments)
- Number of Searches
- Amount of Data
- Vertical vs. Horizontal Content
- One extreme: Specific demo may cover just one discipline, for example Medical Journals
- Other extreme: Internet covers vastly disparate domains
- Vocabulary Variation / Mistmatch: Search vs. Content
- Users
- Experienced vs. New Searcher
- Subject Expert vs. Novice
- Spelling, typing and computer proficiency
- Reading Level, Native Language, IQ
- Interface Medium (large visual display, small text display, audible, Braille, etc)
- Amount of Effort to understand Search
- Willingness to Iterate
- Searching for specific answer vs. General Exploration
- Type of Searches
- Length / 1 or 2 words
- Full question
- Sample text
- Internet Boolean
- Advanced Boolean / Syntax / Proximity
- Wildcard, Wildcard, etc.
- Abbreviations
- Punctuation
- Chemical
- Source Code
- Units of Measure
- Literal vs. Search Operator
- Popular vs. Outlier / Researcher
- Potential for Shared Search Engine Biases
- TF/IDF
- Shared Thesaurus
- Similar fuzzy matching (Snowball, Soundex, etc)
Multilingual Search Evaluations
Non-Textual Search Evaluations
Code Implementations
Data Considerations
- General Considerations
- Character Encoding
- Simple record files vs. XML
- Interchange with Excel / OpenOffice / Numbers / Google Docs
- Interaction w/ Databases...
- Interaction w/ OpenSearch
- Version Control
- Specific Entities to Store
- Sample Documents
- Searches (AKA TREC Topics)
- Relevancy Judgments (AKA TREC qrels)
- AB Preferences (click-throughs, explicit ratings, etc)
- Controlled Index vs. Federated Search (TODO: explain)
- Search Engine Results List Formats / APIs
- Textual vs. Non-Textual Data
- Corpus, Searches and Judgments
- See other section for non-text discussion