This repro contains raw data from a meta-analysis of papers that used the test collection from the TREC 2004 Robust Track (Robust04), as described in:
- Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. Proceedings of the 42nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019), July 2018, Paris, France.
We exhaustively examined every publication from 2005 to 2018 in the following venues to identify those that reported results on Robust04: SIGIR, CIKM, WWW, ICTIR, ECIR, KDD, WSDM, TOIS, IRJ, IPM, and JASIST. This was supplemented by Google Scholar searches to identify a few additional papers not in the venues indicated above. Our meta-analysis was conducted in January 2019, but after the paper acceptance we included a few more papers. A number of exclusion criteria were applied, best characterized as discarding corner cases: for example, papers that only used a subset of the topics or papers that had metrics plotted in a graph. In total, we examined 130 papers; of these, 109 papers contained extractable average precision values that formed the basis of the results reported below. Note that some papers did not report AP, and thus were excluded from consideration.
For each of the 109 papers, we extracted the highest average precision score achieved on Robust04 by the authors' proposed methods, regardless of experimental condition (ignoring oracle conditions and other unrealistic setups). We further categorized the papers into neural (18) and non-neural (91) approaches. Methods that used word embeddings but not neural networks directly in ranking were considered "neural" in our classification. From each paper we also extracted the authors' baseline: in most cases, these were explicitly defined; if multiple were presented, we selected the best. If the paper did not explicitly mention a baseline, we selected the best comparison condition using a method not by the authors (or based on previous work).
A visualization of our meta-analysis is presented above. For each paper, we show the baseline and the best result as an empty circle and a filled circle (respectively), connected by a line. All papers are grouped by their publication year. Neural approaches are shown in blue, and non-neural approaches in red. We also show two regression trendlines, for non-neural (red) as well as neural approaches (blue). A number of reference conditions are plotted as horizontal lines: the best submitted run at the TREC 2004 Robust Track (TREC best) at 0.333 AP is shown as a solid black line, and the median TREC run under the "title" condition at 0.258 AP is shown as a dotted black line (TREC median). Finally, we show the effectiveness of an untuned RM3 run (i.e., default parameters) from the Anserini system.
The follow results table is generated via the Python script json_to_md.py
, which summarizes the raw results in robust04_papers.json
.
The column "standard" indicates if the paper used a "standard" configuration of Robust04; a non-standard configuration might be an evaluation that uses only a subset of the topics.
For the "non-standard" papers we did not extract effectiveness metrics, and in some of the "standard" papers effectiveness metrics were not easily extractable (e.g., they are presented in a graph).
In both cases AP values in the table were left blank.