Schema-Guided Dialogue State Tracking (DSTC 8)

Organizers - Abhinav Rastogi, Xiaoxue Zang, Raghav Gupta, Srinivas Sunkara, Pranav Khaitan

Contact - schema-guided-dst@google.com

Important Dates

	Date
Task description released.	06/17/2019
Sample data released. Development phase begins.	06/18/2019
Single domain dataset (train + dev) released	07/07/2019
Multi domain dataset (train + dev) released	07/23/2019
Baseline Released	08/04/2019 (expected)
Test phase begins.	10/07/2019
Entry submission deadline.	10/13/2019
Objective evaluation completed.	10/20/2019

Updates

07/23/2019 - Train and dev sets for multi domain dataset have been released. The dataset files have been renamed to accommodate additional dialogues in the multi domain dataset. The expected date for baseline release has been updated. We apologize for any inconvenience caused by these delays resulting from factors beyond our control.

07/18/2019 - The fields that will be missing from the unlabelled test data have been added to the Dialogue Representation section.

07/07/2019 - Train and dev sets for the single domain dataset have been released. The baseline system will be released with the multi domain dataset on 07/19/2019. The sample dialogues released earlier in train/dialogues_1.json have been moved to train/dialogues_43.json.

06/30/2019 - Due to some unforeseen delays, the single domain dataset is expected to be released on 07/07/2019. The other dates are unchanged.

Introduction

Virtual assistants such as the Google Assistant, Alexa, Siri, Cortana etc. help users accomplish tasks by providing a natural language interface to service providers (backends/APIs). Such assistants need to support an ever-increasing number of services and APIs. Building a dialogue system that can seamlessly operate across all these services is the holy grail of conversational AI. The Schema-Guided State Tracking track in the 8th Dialogue System Technology Challenge explores dialogue state tracking in such a setting. This challenge is aimed to encourage research on multi-domain dialogue systems that can efficiently interface with a large number of services/APIs. We have taken a few other practical considerations (e.g, defining a large variety of slot types, not providing a list of all possible values for some slots etc.) while creating this dataset to closely mimic the real world use cases.

Under the schema-guided approach, the dialogue state representation is based on the schemas for the services under consideration (see figure below for an example). Each dialogue in the dataset is accompanied by schemas listing a set of user intents and slots, and a sentence describing their semantics in natural language. The dialogue state needs to be predicted over these intents and slots. Such a setup ensures that the system learns to recognize the semantics of intents and slots from their descriptions, rather than just treating them as labels, thus allowing zero-shot generalization to new schemas. Furthermore, this approach enables the system to recognize patterns across domains (e.g, slots for date or time are specified in a similar manner across different services) without providing an explicit alignment between slots of different services.

Data

The dataset consists of conversations between a virtual assistant and a user. These conversations have been generated with the help of a dialogue simulator and paid crowd-workers using an approach similar to that outlined here. The details and code for the dialogue simulator will be made public after the competition. The dataset is provided "AS IS" without any warranty, express or implied. Google disclaims all liability for any damages, direct or indirect, resulting from the use of the dataset.

Schema Representation

A schema is a normalized representation of the interface exposed by a service/API. The schemas have been manually generated by the dataset creators. This representation of schema has been chosen to reflect real world services/APIs by imposing the following restrictions:

Arbitrary calls are not allowed. A service/API only offers a limited number of ways to call it. E.g, a call to a restaurant search API may only be made once the location and cuisine are known. Although this doesn't impact the dialogue state tracking task directly, it has implications on the flow of the dialogue (e.g, the system must gather the value of all required slots from the user before making a service call), which participants may exploit.
The list of all possible entities is not accessible. Many services don't expose the set of all available entities (defined as an assignment of values to all slots or a single row in the underlying database), and for others this list could be very large or dynamic. So, we don't provide a list of all possible entities. This rules out approaches which represent dialogue state as a distribution over the set of all possible entities.
The list of all values taken by a slot is not provided for some slots. Obtaining the list of all possible values taken by some slots is not feasible because this could be very large (restaurant name, city etc.), unbounded (date, time, username etc.) or dynamic (movie, song etc.) This rules out approaches which represent dialogue state as set of distribution over the set of all possible values of a slot, or which iterate over all possible slot values. The list of all possible values is provided for some slots like price range, star rating, number of people etc, where it is natural to do so.

The schema for a service contains the following fields:

service_name - A unique name for the service.
description - A natural language description of the tasks supported by the service.
slots - A list of slots/attributes corresponding to the entities present in the service. Each slot contains the following fields:
- name - The name of the slot.
- description - A natural language description of the slot.
- is_categorical - A boolean value. If it is true, the slot has a fixed set of possible values.
- possible_values - List of possible values the slot can take. If the slot is a categorical slot, it is a complete list of all the possible values. If the slot is a non categorical slot, it is either an empty list or a small sample of all the values taken by the slot.
intents - The list of intents/tasks supported by the service. Each method contains the following fields:
- name - The name of the intent.
- description - A natural language description of the intent.
- is_transactional - A boolean value. If true, indicates that the underlying API call is transactional (e.g, a booking or a purchase), as opposed to a search call.
- required_slots - A list of slot names whose values must be provided before making a call to the service.
- optional_slots - A dictionary mapping slot names to the default value taken by the slot. These slots may be optionally specified by the user and the user may override the default value. An empty default value allows that slot to take any value by default, but the user may override it.

Dialogue Representation

The dialogue is represented as a list of turns, where each turn contains either a user or a system utterance. The annotations for a turn are grouped into frames, where each frame corresponds to a single service. Each turn in the single domain dataset contains exactly one frame. In multi-domain datasets, some turns may have multiple frames.

Each dialogue is represented as a json object with the following fields:

dialogue_id - A unique identifier for a dialogue.
services - A list of services present in the dialogue.
turns - A list of annotated system or user utterances.

Each turn consists of the following fields:

speaker - The speaker for the turn. Possible values are "USER" or "SYSTEM".
utterance - A string containing the natural language utterance.
frames - A list of frames, each frame containing annotations for a single service.

Each frame consists of the fields listed below. The fields marked with * will be excluded in the test data released to the participants.

service - The name of the service corresponding to the frame. The slots and intents used in the following fields are taken from the schema of this service.
slots* - A list of slot spans in the utterance, only provided for non-categorical slots. Each slot span contains the following fields:
- slot - The name of the slot.
- start - The index of the starting character in the utterance corresponding to the slot value.
- exclusive_end - The index of the character just after the last character corresponding to the slot value in the utterance. In python, utterance[start:exclusive_end] gives the slot value.
actions (system turns only) - A list of actions corresponding to the system. Each action has the following fields:
- act - The type of action. The list of all possible system acts is given below.
- slot (optional)- Aslot argument for some of the actions.
- values (optional)- A list of values assigned to the slot. If the values list is non-empty, then the slot must be present.
state* (user turns only) - The dialogue state corresponding to the service. It consists of the following fields:
- active_intent - The intent corresponding to the service of the frame which is currently being fulfilled by the system. It takes the value "NONE" if none of the intents are active.
- requested_slots - A list of slots requested by the user in the current turn.
- slot_values - A dictionary mapping slot name to a list of strings. For categorical slots, this list contains a single value assigned to the slot. For non-categorical slots, all the values in this list are spoken variations of each other and are equivalent (e.g, "6 pm", "six in the evening", "evening at 6" etc.).

List of possible system acts:

INFORM - Inform the value for a slot to the user. The slot and values fields in the corresponding action are always non-empty.
REQUEST - Request the value of a slot from the user. The corresponding action always contains a slot, but values are optional. When values are present, they are used as examples for the user e.g, "Would you like to eat indian or chinese food or something else?"
CONFIRM - Confirm the value of a slot before making a transactional service call.
OFFER - Offer a certain value for a slot to the user. The corresponding action always contains a slot and a list of values for that slot offered to the user.
NOTIFY_SUCCESS - Inform the user that their request was successful. Slot and values are always empty in the corresponding action.
NOTIFY_FAILURE - Inform the user that their request failed. Slot and values are always empty in the corresponding action.
INFORM_COUNT - Inform the number of items found that satisfy the user's request. The corresponding action always has "count" as the slot, and a single element in values for the number of results obtained by the system.
OFFER_INTENT - Offer a new intent to the user. Eg, "Would you like to reserve a table?". The corresponding action always has "intent" as the slot, and a single value containing the intent being offered. The offered intent belongs to the service corresponding to the frame.
REQ_MORE - Asking the user if they need anything else. Slot and values are always empty in the corresponding action.
GOODBYE - End the dialogue. Slot and values are always empty in the corresponding action.

The Dialogue State

The dialogue state is the system's estimate of the user's goal based on the dialogue context. It is used to identify the appropriate service call to make and to assign values to different slots required by the service. The state is also used by the system to generate the next actions. In our setup, a separate dialogue state is maintained for each service in the corresponding frame. All turns in a single domain dialogue have exactly one frame. However, turns in multi-domain dialogues may have more than one frame. More details about this will be added after the multi domain dataset is released, however no changes will be made to single domain dialogues.

The dialogue state is only available for user turns. Within a frame, the dialogue state contains information about the active intent for the corresponding service, the set of slots whose values are requested by the user in the current turn and the values assigned to different slots in the service based on the dialogue context. The values assigned to different slots are represented as a list. For categorical slots, this list takes exactly one value, and this value is listed in the schema. Arbitrary string values are supported for non-categorical slots. We constrain these values to be substrings in the user or the system utterance. For such slots, the list of values contains all variations for the same value over the dialogue seen till the current turn (e.g, "6 pm", "six in the evening", "evening at 6" are spoken variations for the same value). The state tracker can output any of these variations. This decision was made to avoid the additional complexity of obtaining the canonicalized version of a value on the dialogue state tracker. In the real world, this canonicalization can be done by the service being called or by a separate module. For date slots, some of the dialogues contain a relative quantifier (e.g, "today", "tomorrow", "next thursday" etc.). For these values, March 1st, 2019 has been treated as today's date for all dialogues.

Dataset Statistics

The dataset consists of two kinds of dialogues.

Type of Dialogue	Train files	Dev files
Single Domain	`dialogues_001.json` to `dialogues_043.json`	`dialogues_001.json` to `dialogues_007.json`
Multi Domain	`dialogues_044.json` to `dialogues_127.json`	`dialogues_008.json` to `dialogues_020.json`

The single domain dialogues involve interactions with a single service, possibly over multiple intents. The multi-domain dialogues have interactions involving intents belonging to two or more different services. The multi-domain dialogues also involve transfer of dialogue state values from one service to the other wherever such a transfer is deemed natural. Eg, if a user finds a restaurant and searches for a movie next, the dialogue state for movie service is already initialized with the location from the dialogue state for restaurant service.

The overall statistics of the train and dev sets are given below. The term informable slots refers to the slots over which the user can specify a constraint. For example, slots like phone_number are not informable.

	Train			Dev
	Single-domain	Multi-domain	Combined	Single-domain	Multi-domain	Combined
No. of dialogues	5,403	10,739	16,142	836	1,646	2,482
No. of turns	82,588	247,376	329,964	11,928	36,978	48,726
No. of tokens (lower-cased)	807,562	2,409,805	3,217,367	117,496	353,381	470,877
Average turns per dialogue	15.286	23.035	20.441	14.268	22.356	19.632
Average tokens per turn	9.778	9.741	9.751	9.850	9.603	9.664
Total unique tokens (lower-cased)	16,353	25,459	30,352	6,803	10,533	12,719
Total no. of slots	201	214	214	134	132	136
Total no. of informable slots	143	149	151	94	91	94
Total unique slot values (lower-cased)	7,070	11,635	14,139	2,418	4,182	5,101
Total unique informable slot values (lower-cased)	3,843	6,669	7,998	1,223	2,254	2,673
Total domains	14	16	16	16	15	16
Total services	24	26	26	17	16	17
Total intents	35	37	37	28	26	28

The following table shows how the dialogues and services are distributed among different domains for the train and dev sets. In this table, each multi-domain dialogue contirbutes to the count of every service present in the dialogue. Please note that a few domains like Travel and Weather are only present in the dev set. This is to test the generalization of models on unseen domains. The test set will similarly have some unseen domains which are neither present in the training nor in the dev set. Also, the number in parenthesis represents the number of unique services belonging to the corresponding domain.

In the first column, it indicates the number of unique services for the domain in Train and Dev datasets combined.
In the fourth column, it indicates the number of such unique services in the Train dataset only.
In the last column, it indicates the number of such unique services in the Dev dataset only.

	# Dialogues Train			# Dialogues Dev
	Single-domain	Multi-domain	Combined	Single-domain	Multi-domain	Combined
Alarm (1)	NA	NA	NA	37	NA	37 (1)
Banks (2)	207	520	727 (1)	42	252	294 (1)
Buses (2)	310	1,970	2,280 (2)	44	285	329 (1)
Calendar (1)	169	1,433	1,602 (1)	NA	NA	NA
Events (2)	788	2,721	3,509 (1)	73	345	418 (1)
Flights (3)	985	1,762	2,747 (2)	94	297	391 (1)
Homes (1)	268	579	847 (1)	81	99	180 (1)
Hotels (4)	457	2,896	3,353 (3)	56	521	577 (2)
Media (2)	281	832	1,113 (1)	46	133	179 (1)
Movies (2)	292	1,325	1,617 (1)	47	94	141 (1)
Music (2)	394	896	1,290 (2)	35	161	196 (1)
RentalCars (2)	215	1,370	1,585 (2)	39	342	381 (1)
Restaurants (2)	367	2052	2,419 (1)	73	263	336 (1)
RideSharing (2)	119	1,584	1,703 (2)	45	225	270 (1)
Services (4)	551	1,338	1,889 (3)	44	157	201 (1)
Travel (1)	NA	1,871	1,871 (1)	45	238	283 (1)
Weather (1)	NA	951	951 (1)	35	322	357 (1)

Evaluation

The following metrics are defined for evaluation of dialogue state tracking. A python script for computing these metrics will be released soon with additional details. The joint goal accuracy will be used as the primary metric for ranking submissions.

Active intent accuracy - The fraction of user turns for which the active intent has been correctly predicted.
Slot tagging F1 - The macro-averaged F1 score for tagging slot values for non-categorical slots. This metric is optional to report in the final paper if participants decide not to use slot tagging.
Requested slots F1 - The macro-averaged F1 score for requested slots over the turns. For a turn, if there are no requested slots in both the ground truth and the prediction, that turn is skipped. The reported number is the average F1 score for all un-skipped user turns. This metric is optional to report in the final paper.
Average goal accuracy - For each turn, participants must predict a single value for each slot present in the dialogue state. The slots which have a non-empty assignment in the ground truth dialogue state are only considered. This is the average accuracy of predicting the value of a slot correctly. A fuzzy matching based score is used for non-categorical slots. Note that we don't report per-slot metrics because of a large variety of slots in this dataset, and the set of slots present in the training, dev and test set are not identical.
Joint goal accuracy - This is the average accuracy of predicting all slot assignments for a turn correctly. A fuzzy matching based score is used for non-categorical slots. This is the primary evaluation metric used for ranking submissions. More details to follow with the evaluation script.

Rules

Participation is welcome from any team (academic, corporate, government etc.).
The identity of participants will not be made public by the organizers. It may be orally announced at the workshop chosen for communicating results. The participants may choose to identify their own team in the paper.
Participants are allowed to use any external datasets, resources or pre-trained models.
The developed systems should be feasible to use in live systems in terms of runtime.
Manual inspection of test set is not permitted. This is enforced on an honorary basis.
Participants may report results on either or both of the single domain and multi domain datasets.

Acknowledgements

We thank Amir Fayazi, Maria Wang, Ulrich Rueckert and Jindong Chen for their valuable suggestions and support in the formulation of this track and collection of this dataset.

goddoe / dstc8-schema-guided-dialogue