(This was the initial proposal for the ORP)
The Open Relevance Project is an effort to collect and disseminate free, publicly available corpora, one or more sets of queries for the corpora and relevance judgments. We intend to leverage the power of open source and community to bootstrap this effort.
While TREC and other conferences provide corpora, query sets and judgments, none of them do so in a free and open way. Furthermore, while their distribution rights would allow the use of the corpora by committers at Apache, they do not allow for wider dissemination to the community as a whole, which severely hinders qualitative improvements in relevance in open source search projects. While we are starting this under the auspices of the Lucene project, we are by no means aiming to be Lucene only. Eventually, this could become its own top level project and go beyond just search to offer collections, etc. for machine learning and other tracks.
This project is not even a project yet, so our first step is to gauge interest and then make it a project. Assuming that goes forward, however, we see that there are at least three parts to moving forward, outlined in the subsections below.
We have started a preliminary crawl of Creative Commons content using Nutch. This is currently hosted on a private machine, but we would like to bring this "in house" to the ASF and have the ASF host both the crawling and the dissemination of the data. This, obviously, will need to be supported by the ASF infrastructure, as it is potentially quite burdensome in terms of disk space and bandwidth.
The crawled content will be collected with only minimal content filtering (e.g. to remove unhandled media types), that is including also pages that may be potentially regarded as spam, junk, spider traps, etc. The proposed methodology is to start from a seed list of Creative Commons sites (or subsets of sites), and crawl exhaustively all linked resources within each site, linked below the seed root URL, if such resources are covered by CC licenses (and terminate on nodes that aren't). As a side effect of using Nutch the project will also provide an encoded web graph in a form of adjacency lists, both incoming and outgoing.
In addition to the usual developer resources (SVN, mailing lists, site) the project will need additional machine and bandwidth resources related to the process of collection, editing and distribution of corpora and relevance judgments.
We would like to collect at least 50 mln pages in the Creative Commons crawl. Assuming that each page takes ~2kB (compressed), and adding Lucene indexes and temporary space, the project would need no less than 250 GB of disk space. For Hadoop processing at least two machines would be needed, although initially a smaller corpus could be processed on a single machine. In terms of bandwidth, to get the initial data set we likely need to download ~500GB (not all servers support gzip encoding), including redirects, aliases (the same content reachable via different urls), etc.
Editing of relevance judgments can be performed through a web application, so the infrastructure needs to provide a servlet container. Search functionality will be also provided by a web application.
Distribution of the corpus is the most demanding aspect of this project. Due to its size (~100GB) it's not practical to offer this corpus as a traditional download. ''(use P2P ? create subsets? distribute on HDD ?)''. Amazon S3 and EBS (via Amazon Public Datasets) are efficient & cheap options for distributing larger datasets. Uploading to a public S3 bucket is the easiest option, and automatically makes uploaded data available via torrent. Datasets up to 1 TB can also be distributed via free public EBS volumes.
We believe we can crowdsource the query effort simply by asking people to generate queries for the collection via a wiki page that anyone can edit. While this could result in gaming in the early stages, we believe over time the query set will stabilize. Depending on the user privacy agreement, it might be possible for Wikipedia to make a set of search query referrals available from server logs (without associated user information). Any personally identifiable information in the queries (SSNs, etc.) could be scrubbed, although it would be unlikely these queries would lead to search clicks on Wikipedia.
Unlike TREC, we will focus only on relevance judgments for the top ten or twenty results. We will need to figure out a way to "pool" and/or validate the results. Again, a wikipedia still approach may work here. '''NOTE: There was a SIGIR poster/paper a little while ago about crowd-sourcing relevance judgments. Link to it here.'''
- ''Is this it?'': Relevance judgments between TREC and Non-TREC assessors (Alternate link)
- ''I think its this one:'' Alonso, O., Rose, D. E., & Stewart, B. (2008). Crowdsourcing for relevance evaluation. SIGIR Forum, 42(2), 9-15. doi: 10.1145/1480506.1480508.:http://doi.acm.org/10.1145/1480506.1480508
- ''Look out for ACM SIGIR 2009 paper on relevance judgements as a game.''
- Attract volunteers
- Get infrastructure backing
- Academic involvement? SIGIR? Others?
This topic has come up a number of times in the past. Grant Ingersoll has corresponded with the lead of TREC, Ellen Voorhees, and with the Univ. of Glasgow (keepers of some of the TREC documents, see http://ir.dcs.gla.ac.uk/test_collections/) about either obtaining TREC resources (even if the ASF has to pay) or creating a truly open collection. See http://www.lucidimagination.com/search/document/656d5ca50c8c9242/trec_collection_nist_and_lucene, specifically http://www.lucidimagination.com/search/document/656d5ca50c8c9242/trec_collection_nist_and_lucene#84e9e24ee9ff4779. On the Glasgow side, all conversations there were private, and thus not available. The gist of them is that the only possibility for distribution is to a limited number of committers within the ASF on a project by project basis (even though the ASF is purchasing) and that the burden is on us to maintain a list of people who have access to the documents. Ultimately, while the ASF was still willing to make the purchase, Grant felt it was too much of a burden and not beneficial to Open Source and has thus not proceeded with obtaining the collection.