Awesome Information Retrieval
- A curated list of information retrieval and web search resources from all around the web.
Contributing
Please feel free to send me pull requests or email (harshal.priyadarshi@utexas.edu) to add new links. I am very open to suggestions and corrections.
Table of Contents
Books
- Introduction to Information Retrieval - C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008
- Search Engines: Information Retrieval in Practice - Bruce Croft, Don Metzler, and Trevor Strohman. 2009
- Modern Information Retrieval - R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999
- Information Retrieval in Practice - B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009
- Mining the Web: Analysis of Hypertext and Semi Structured Data - S. Chakrabarti. Morgan Kaufmann, 2002
- Language Modeling for Information Retrieval - W.B. Croft, J. Lafferty. Springer, 2003
- Information Retrieval: A Survey - Ed Greengrass, 2000
Courses
- INF384H / CS395T / INF350E: Concepts of Information Retrieval (and Web Search) - Matthew Lease (University of Texas at Austin)
- CS 276 / LING 286: Information Retrieval and Web Search - Chris Manning and Pandu Nayak (Stanford University)
- CS 371R: Information Retrieval and Web Search - Raymond J. Mooney (University of Texas at Austin)
- CS 172: Introduction to Information Retrieval - Vagelis Hristidis (University of California - Riverside)
- SIMS 240: Principles of Information Retrieval - Ray R. Larson (UC berkeley)
- 11-442 / 11-642: Search Engines - Jamie Callan (CMU)
- 600.466: Information Retrieval and Web Agents - David Yarowsky (John Hopkins University)
- CS 435: Information Retrieval, Discovery, and Delivery - Andrea LaPaugh (Princeton University)
- Information Retrieval and Data Mining - Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI)
Software
Datasets
Standard IR Collections
- Cranfield Collections - This is one of the first collections in IR domain, however the dataset is too small for any statistical significane analysis, but is nevertheless suitable for pilot runs.
- TREC Collections - TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks are:
- Blog
- Chemical IR
- Clinical Decision Support
- Confusion
- Contextual Suggestion
- Crowdsourcing
- Enterprise
- Entity
- Filtering
- Federated Web Search
- Genomics
- High Accuracy Retrieval from Documents (HARD)
- Interactive Track
- Knowledge base acceleration
- Legal Track
- Medical Track
- Microblog Track
- Million Query Track
- Novelty Track
- Query Track
- Question Answering Track
- Relevance Feedback Track
- Robust Track
- Session Track
- SPAM Track
- Spoken Document Retrieval Track
- Tasks Track
- Temporal Summarization Track
- Terabyte Track
- Web Track
- GOV2 Test Collection - This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel
- NTCIR Test Collection - This is a collection of wide variety of dataset ranging from adhoc collection, chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.
- CLIR Test Collections - This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:
- Multilingual CLIR
- Bilingual CLIR
- Single Language CLIR
- Cross Language Q&A (CLQA) dataset collection - It supports following bi-lingua and mono-lingua:
- Bi-lingua
- Japanese to English
- Chinese to English
- English to Japanese
- English to Chinese
- Mono-lingua
- Chinese to Chinese
- Japanese to Japanese
- English to English
- Bi-lingua
- Advanced Cross Linugal Information Retrieval and Question Answering (ACLIA) - The dataset is used for the task of cross-linugal question answering but the complexity of the task is higher than CLQA dataset.
- CLIR Test Collections - This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:
- Conference and Labs of the Evaluation Forum (CLEF) dataset - It contains a multi-linugal document collection. The test suite includes:
- AdHoc - News Test suite
- Domain Specific Test Suite - On collections of scientific articles
- Question Answering Test Suite
- Reuters Corpora - The corpora is now available through NIST. The corpora includes following:
- RCV1 (Reuter's Corpus Volume 1) - Consists of only English language News stories
- RCV2 (Reuter's Corpus Volume 2) - Consists of stories in 13 languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). Note that the stories are not parallel.
- TRC (Thomson Reuters Text Research Collection) - This is a fairly recent corpus consisting of 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14.
- 20 Newsgroup dataset - This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.
- English Gigaword Fifth Edition - This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.
- Document Understanding Conference (DUC) datasets - Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.
External Curation Links
Talks
Technical Talks
- Challenges in Building Large-Scale Information Retrieval Systems - Jeff Dean (WSDM Conference, 2009)
- Knowledge-based Information Retrieval with Wikipedia - David Wilne (The University of Waikato, 2008)
- Music Information Retrieval Using Locality Sensitive Hashing - Steve Tjoa (Rackspace Developers) [This talk shows that IR is not just text and images]
Philosophical Talks
- The moral bias behind your search results - Andreas Ekström (Swedish Author & Journalist, TED Talk)
- Beware online "filter bubbles" - Eli Pariser (Author of the Filter Bubble, TED Talk)
- Think your email's private? Think again - Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it]
Conferences
- Web Search and Data Mining Conference - WSDM
- Special Interests Group on Information Retrieval - SIGIR
- Text REtrieval Conference - TREC
- European Conference on Information Retrieval - ECIR
- World Wide Web Conference - WWW
- Conference on Information and Knowledge Management - CIKM
- Forum for Information Retrieval Evaluation - FIRE
- Conference and Labs of the Evaluation Forum - CLEF
- NII Testsbeds and Community for Information access Research - NTCIR
Blogs
- Information Retrieval and the Web - Google Research
- IR Thoughts - Dr. Edel Garcia
License
To the extent possible under law, Harshal Priyadarshi has waived all copyright and related or neighboring rights to this work.