nkzhlee / MSMARCOV2

Utilities and Descriptions Related to the MSMARCO DATASET

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MSMARCO V2

MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension and question answering. In MS MARCO, all question have been generated from real anonymized Bing user queries which grounds the dataset in a real world problem and can provide researchers real contrainsts their models might be used in.The context passages, from which the answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated.

First released at NIPS 2016 MSMARCO was 100,000 queries with a variety of domains. Since then, the MSMARCO team has been hard at work making the data bigger and better. Details about all the changes/improvements are below. If you have suggestions for our dataset, novel uses, or general feedback please reach out to us at ms-marco@microsoft.com

In its current form(V2.1), there are 1,010,916 unique real queries that were generated by sampling and anonymizing Bing usage logs. After sampling, we used Bing to extract the 10 most relevant passages and asked a human judge to answer the query given the information present. Not all queries had answers present in the passages but we thought this data would be useful to help systems learn that not all answers can be answers. Around 35% of all queries in the dataset have the answer 'No answer Present' meaning judges were unable to answer the question given the information provided by the 10 relevant passages.

Dataset Generation, Data Format, And Statistics

What is the difference between MSMARCO and other MRC datasets? We believe the advantages that are special to MSMARCO are:

  • Real questions: All questions have been sample from real anonymized bing queries.
  • Real Documents: Most Url's that we have source the passages from contain the full web documents. These can be used as extra contextual information to improve systems or be used to compete in our expert task.
  • Human Generated Answers: All questions have an answer written by a human. If there was no answer in the passages the judge read they have written 'No Answer Present.'
  • Human Generated Well-Formed: Some questions contain extra human evaluation to create well formed answers that could be used by intelligent agents like Cortana, Siri, Google Assistant, and Alexa.
  • Dataset Size: At over 1 million queries the dataset is large enough to train the most complex systems and also sample the data for specific applications.

Generation

The MSMARCO dataset is generated by a well oiled pipeline optimized for the highest quality examples. the general process runs as follows.

  1. Bing logs are sampled, filtered and anonymized to make sure the queries we are collecting are both useful to the research community and respectful to our bing users and fans.
  2. Using the sampled and anonymized queries Bing generates the 10 most relevant passages for the query.
  3. Highly trained judges read the query and its related passages and if there is an answer present, the supporting passages are annotated and a natural language answer is generated.
  4. A smaller proportion of queries(~17% of overall dataset with 182,887 unique queries) are then passed on to a second round of judges who are asked to verify the answer is correct and rewrite(if possible) the query to be a well formed answer. These answers are designed to be understood without perfect context and are designed with smart speakers/digital assistants in mind.

Data Format

Based on feedback from our community the V2.1 now dataset has been provided in the JSONL format. Since some users might prefer to use regular json we have included some simple implementation of translators(json to jsonl and jsonl to json)

Each line/entry containts the following parameters to be described below: query_id, query_type, query, passages, answers, and wellFormedAnswers.

  1. query_id: A unique id for each query that is used in evaluation
  2. query: A unique query based on initial Bing usage
  3. passages: A set of 10:passages, URLs, and an annotation if they were used to formulate and answer(is_selected:1). Two passages may come from the URL and these passages have been obtained by Bing as the most relevant passages. If a passage is maked as is_selected:1 it means the judge used that passage to formulate their answer. If a passage is marked as is_selected:0 it means the judge did not use that passage to generate their answer. Questions that have the answer of 'No Answer Present.' will have all passages marked as is_selecte: 0.
  4. query_type: A basic division of queries based on a trained classifier. Categories are:{LOCATION,NUMERIC,PERSON,DESCRIPTION,ENTITY} and can be used to debug model performance or make smaller more forcused datasets.
  5. answers: An array of answers produced by human judges, most contain a single answer but ~1% contain more than one answer(average of ~2 answers if there are multiple answers). These answers were generated by real people in their own words instead of selecting a span of text. The language used in their answer may be similair or match the language in any of the passages.
  6. wellFormedAnswers. An array of rewritten answers, most contain a single answer but ~1% contain more than one answer(average of ~5 answers if there are multiple answers). These answers were generated by having a new judge read the answer and the query and they would rewrite the answer if it did not (i) include proper grammar to make it a full sentence, (ii) make sense without the context of either the query or the passage, (iii) had a high overlap with exact portions in one of the context passages. This ensures that well formed answers are true natural languge and not just span selection. Well Formed Answers are a more difficult for of Question answering because they contain words that may not be present in either the question or any of the context passages.

example

{
	"answers":["A corporation is a company or group of people authorized to act as a single entity and recognized as such in law."],
	"passages":[
		{
			"is_selected":0,
			"url":"http:\/\/www.wisegeek.com\/what-is-a-corporation.htm",
			"passage_text":"A company is incorporated in a specific nation, often within the bounds of a smaller subset of that nation, such as a state or province. The corporation is then governed by the laws of incorporation in that state. A corporation may issue stock, either private or public, or may be classified as a non-stock corporation. If stock is issued, the corporation will usually be governed by its shareholders, either directly or indirectly."},
		...
		}],
	"query":". what is a corporation?",
	"query_id":1102432,
	"query_type":"DESCRIPTION",
	"wellFormedAnswers":"[]"
}

Utilities, Stats and Related Content

Besides the main files containing judgments, we are releasing various utilites to help people explore the data and optimize the data for their needs. They have only been tested with python 3.5 and are provided as is. Usage is noted below. If you write any utils you feel the community could use and enjoy please submit them with a pull request.

File Conversion

Our community told us that they likled being able to have the data in both json format for easy exploration and JSONLformat to make running models easier. To help the easy transition from one file format to another we have included tojson.py and tojsonl.py.

Convert a JSONl(V1 Format) file to JSON(V2 format)

python3 tojson.py <your_jsonl_file> <target_json_filename>

Convert a JSON(V2 Format) file to JSONL(V1 format)

python3 tojsonl.py <your_json_file> <target_jsonl_filename>

Additionally, you can use converttowellformed.py to take an existing slice of the dataset and narrow it to only queries that have a well formed answer. Usage bellow.

python3 converttowellformed.py <your_input_file(json)> <target_json_filename>

Dataset Statistics

Statistics about the dataset were generated with the exploredata.py file. They can be found in the Stats folder. You can use the explore datafile to generate similiar statistics on any slice you create of the dataset.

python3 exploredata.py <your_input_file(json)> <-p if you are using a dataslice without answers>

Tasks

In an effort to produce a dataset that can continue to be challanging and rewarding we have broken down the MSMARCO dataset into tasks of varying difficulty.

Novice Task

Given a query q, and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. Then, the system must accuratley decide if the passages provide addequate information to answer the query since not all queries have an answer. If there is not enough information, the system should response 'No Answer Present.'. If there is enough information the system should create a quality answer. The target of the answer a should be as close as possible to the human generated refrence answers RA= ra1,ra2,...,ram. Evaluation will be done using ROUGE-L, BLEU-1, and a to be annonced metric. To ensure systems are robust and can addapt to queries without an answer this tasks score will be weighted. This weight will be derived by obtaining the average system success score over Q, which is the set of query-passage pairs. To obtain the system success score for a query q do the following(scripts to follow). Given a query q and passages P, if the reference answer is 'No Answer Present.' and the system produces and answer award a score of 0, If the reference answer is not 'No Answer Present.' and the system produces the answer of 'No answer Present.' award a score of 0. All other options gain a score of 1.

Intermediate Task

Given a query q, and a set of passages P = p1, p2, p3,... p10 a successful Machine Reading Comprehension system is expected to read and understand both the questions and passages. For this task all queries have an answer so systems do not need to understand No Answer Queries. Using the relevant passages a successful system should produce a candidate answer that should be as close as possible to the human generated well formed refrence answers RA= wfra1,wfra2,...,wfram. Evaluation will be done using ROUGE-L, BLEU-1, and a to be annonced metric.

Expert Task

TBD

Evaluation

Evaluation of systems will be done using the industry standard BLEU and ROUGE-L. These are far from perfect but have been the best option we have found that scales. If you know of a better metric or want to brainstorm creating one please contact us.

We have made the official evaluation script along with a sample output file on the dev set available for download as well so that you can evaluate your models. Download the evaluation scripts The evaluation script takes as inputs a reference and candidate output file. You can execute the evaluation script to evaluate your models as follows: ./run.sh

Current evaluation scripts do not contain the discounted scores mentioned in the novice task but should soon be released.

Submissions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps: Run the evaluation script on the test set and generate the output results file for submission Submit the following information by [contacting us](mailto:ms-marco@microsoft.com?subject=MS Marco Submission) Individual/Team Name: Name of the individual or the team to appear in the leaderboard [Required] Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard [Optional] Model information: Name of the model/technique to appear in the leaderboard [Required] Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard [Optional]

Please submit your results either in json or jsonl format and ensure that each answer you are providing has its refrence query_id and query_text. If your model does not have query_id and query_text it is difficult/impossible to evalutate the submission. To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time. Because submissions don't require the final trained model we also retain the right to request a model to validate the results being submitted

Feedback

MS MARCO has been designed not as a dataset to be beat but an effort to establish a large community of researchers working on Machine Comprehension. If you have any thoughts on things we can do better, ideas for how to use datasets or general question please dont hesitate to [reach out and ask](mailto:ms-marco@microsoft.com?subject=MS MARCO Feedback).

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Utilities and Descriptions Related to the MSMARCO DATASET

License:MIT License


Languages

Language:Python 97.1%Language:Shell 2.9%