The FoodOrdering dataset is a task-oriented parsing dataset in the food-ordering domain with utterances and
annotations derived from the menus of five venues characteristic of that business vertical: burgers
, burritos
,
coffees
, pizzas
, and subs
.
For each restaurant, human generated data was collected through Mechanical Turk, where the proposed task consisted in formulating a natural language request of an order within a provided menu, for one or multiple persons.
The collected utterances were then manually annotated into machine executable representations (EXR
):
Here are examples from the 4 restaurants/datasets newly contributed: burger
, burrito
, coffee
, and sub
==> data/burger/dev.json <==
{
"SRC": "i would like a vegan burger with lettuce tomatoes and onions and a large order of sweet potato fries",
"EXR": "(MAIN_DISH_ORDER (NUMBER 1 ) (MAIN_DISH_TYPE vegan_burger ) (TOPPING lettuce ) (TOPPING tomato ) (TOPPING onion ) )
(SIDE_ORDER (NUMBER 1 ) (SIZE large ) (SIDE_TYPE sweet_potato_fries ) )"
}
==> data/burrito/dev.json <==
{
"SRC": "let me have a steak white rice and black bean burrito with red chili salsa a side of guacamole and a coke",
"EXR": "(SIDE_ORDER (NUMBER 1 ) (SIDE_TYPE guacamole ) )
(BURRITO_ORDER (NUMBER 1 ) (MAIN_FILLING steak ) (RICE_FILLING white_rice ) (BEAN_FILLING black_beans ) (SALSA_TOPPING red_chili_salsa ) )
(DRINK_ORDER (NUMBER 1 ) (DRINK_TYPE mexican_coca-cola ) )"
}
==> data/coffee/dev.json <==
{
"SRC": "i would like a regular latte cinnamon iced with one extra espresso shot",
"EXR": "(DRINK_ORDER (NUMBER 1 ) (SIZE regular ) (DRINK_TYPE latte ) (ROAST_TYPE cinnamon_roast ) (STYLE iced ) (TOPPING (ESPRESSO_SHOT 1 ) ) )"
}
==> data/sub/dev.json <==
{
"SRC": "i would like a cold cut combo with mayo pickles banana peppers tomato lettuce and pepper jack cheese",
"EXR": "(SANDWICH_ORDER (NUMBER 1 ) (BASE_SANDWICH cold_cut_combo ) (TOPPING regular_mayonnaise ) (TOPPING pickles ) (TOPPING banana_peppers ) (TOPPING tomatoes ) (TOPPING lettuce ) (TOPPING pepperjack ) )"
}
while the 5th restaurant called pizza
comes from https://github.com/amazon-research/pizza-semantic-parsing-dataset.
==> data/pizza/dev.json <==
{
"SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
"EXR": "(PIZZAORDER (NUMBER 2 ) (SIZE medium ) (COMPLEX (QUANTITY extra ) (TOPPING cheese ) ) (TOPPING pepperoni ) )
(PIZZAORDER (NUMBER 2 ) (SIZE medium ) (TOPPING olives ) (TOPPING sausage ) )
(PIZZAORDER (NUMBER 3 ) (SIZE large ) (TOPPING pepperoni ) (TOPPING sausage ) )"
}
ce
We are providing synthetic data for 3 of the 5 restaurants.
-
For
pizza
we are sub-sampling 10,000 utterances from the 2.5M provided in https://github.com/amazon-research/pizza-semantic-parsing-dataset. -
For the
burrito
andsub
skills, we designed templates utterance such as:
please get me a {size} burrito with {topping1} and {topping2} but no {topping3}
and sampled the slot values (size
and topping1
, topping2
, topping3
here) in the templates. The slot values were obtained from catalogs predefined for each restaurant menu. The templates and values were sampled to obtain 10,000 unique utterances for each of the sub, burrito and pizza menus.
Instead of using an executable representation for target semantics, we use a format called TOP-Alias
, which is
reminiscent of the TOP-Decoupled
format. See our publication for more details on how those representation differ.
In the case of synthetically generated data, the two are identical so we will defer the details to the publication.
Note that the EXR
format can be directly obtained from the TOP-Alias
format.
Here are examples of the synthetically generated data:
==> data/burrito/train.json <==
{
"SRC": "i'd prefer four quesadillas with pork cauliflower black beans and grilled veggies with pico de gallo on top",
"TOPALIAS": "(QUESADILLA_ORDER (NUMBER four ) (MAIN_FILLING pork ) (RICE_FILLING cauliflower ) (BEAN_FILLING black beans ) (TOPPING grilled veggies ) (SALSA_TOPPING pico de gallo ) )"
}
==> data/pizza/train.json <==
{
"SRC": "three large pizzas with pecorino cheese and without tuna",
"TOPALIAS": "(PIZZAORDER (NUMBER three ) (SIZE large ) (TOPPING pecorino cheese ) (NOT (TOPPING tuna ) ) )"
}
==> data/sub/train.json <==
{
"SRC": "can you please order me four meatball marinara sandwiches not many honey mustard",
"TOPALIAS": "(SANDWICH_ORDER (NUMBER four ) (BASE_SANDWICH meatball marinara ) (COMPLEX (QUANTITY not many ) (TOPPING honey mustard ) ) )"
}
More details on the dataset conventions and construction can be found in the paper, but at the high level each of the 5 datasets semantics is composed of intents and slots:
- intents nodes - like
DRINK_ORDER
orMAIN_DISH_ORDER
- root a subtree of semantics expressing one general intent in an overall multi-intent request, for example to order a main dish and a side in one single order. Those nodes have no parent nodes. - slot nodes - like
SIZE
orTOPPING
- which have slot values as children (large
,cream
). They are combined with other slots values to qualify the semantics of the higher level intent. - slots can be negated:
without cream
will be expressed as(NOT (TOPPING whipped_cream ) )
- slots can be qualified:
extra whipped cream
will expressed as(COMPLEX (QUANTITY extra ) (TOPPING whipped_cream ) )
- intents can have one or more slots as children nodes, but slots cannot have intents as children nodes.
In the first table we give high level statistics of the skills' schemas in terms of number of intents, slots and slot values.
The detailed schemas and catalogs can be found in ./data/*/schema.json
and ./data/*/alias/*
.
Dataset | # of intents | # of slots | #of slot values (entities) |
---|---|---|---|
burrito | 7 | 12 | 34 |
sub | 3 | 8 | 62 |
pizza (external) | 2 | 11 | 166 |
burger | 3 | 9 | 44 |
coffee | 1 | 10 | 43 |
In the table below, we give relevant utterance-level statistics describing the human generated data:
Dataset | #utterances | #intent/utt | #slots per utt | Avg depth |
---|---|---|---|---|
burrito | 191 | 1.39 | 5.78 | 3.12 |
sub | 162 | 1.69 | 5.99 | 3.07 |
pizza (external) | 348 | 1.25 | 6.13 | 3.62 |
burger | 161 | 1.97 | 7.17 | 3.04 |
coffee | 101 | 1.05 | 5.34 | 3.2 |
and the synthetic data:
Dataset | #utterances | #intent/utt | #slots per utt | Avg depth |
---|---|---|---|---|
burrito | 9,982 | 1.57 | 6.50 | 3.48 |
sub | 10,000 | 1.79 | 6.24 | 3.37 |
pizza (external) | 10,000 | 1.77 | 5.77 | 3.44 |
NOTE1: No synthetic data was generated for burger
and coffee
as they are used to demonstrate zero-shot learning.
NOTE2: Orders can be multi-intents (asking for a main dish as well as drinks), hence the depth computed aboves assumes the presence of a higher level ORDER
node
encapsulating all intents, but not explicitly present in target semantic strings.
The repo structure is as follows:
FoodOrderingDataset
|
|____ data
| |
| |_____ burger
| | |____ dev.json # the human generated/annotated utterances
| | |
| | |____ schema.json # the schema of intents and slots
| | |
| | |____ alias # the catalog values for slots
| |
| |_____ burrito
| | |____ dev.json # the human generated/annotated utterances
| | |
| | |____ train.json # the synthetically generated data
| | |
| | |____ schema.json # the schema of intents and slots
| | |
| | |____ alias # the catalog values for slots
| |
| |_____ coffee
| | |____ dev.json # the human generated/annotated utterances
| | |
| | |____ schema.json # the schema of intents and slots
| | |
| | |____ alias # the catalog values for slots
| |
| |_____ pizza
| | |____ dev.json # [external] the dev set from https://github.com/amazon-research/pizza-semantic-parsing-dataset
| | |
| | |____ train.json # [external] a subset of 10,000 examples taken from training portion of https://github.com/amazon-research/pizza-semantic-parsing-dataset
| | |
| | |____ schema.json # the schema of intents and slots
| | |
| | |____ alias # the catalog values for slots, adapted from https://github.com/amazon-research/pizza-semantic-parsing-dataset
| |
| |_____ sub
| |____ dev.json # the human generated/annotated utterances
| |
| |____ train.json # the synthetically generated data
| |
| |____ schema.json # the schema of intents and slots
| |
| |____ alias # the catalog values for slots
|
|
|____ README.md
See CONTRIBUTING for more information.
This library is licensed under the CC-BY-NC-4.0 License.
If you use this dataset, please cite the following paper:
@inproceedings{a-rubino-etal-2022-cross,
title = "Cross-{TOP}: Zero-Shot Cross-Schema Task-Oriented Parsing",
author = "A. Rubino, Melanie and
Guenon des mesnards, Nicolas and
Shah, Uday and
Jiang, Nanjiang and
Sun, Weiqi and
Arkoudas, Konstantine",
booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
month = jul,
year = "2022",
address = "Hybrid",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.deeplo-1.6",
pages = "48--60",
abstract = "t",
}
and the original PIZZA dataset this work is derived from (see https://github.com/amazon-research/pizza-semantic-parsing-dataset):
@misc{pizzaDataset,
author = {Konstantine Arkoudas and
Nicolas Guenon des Mesnards and
Melanie Rubino and
Sandesh Swamy and
Saarthak Khanna and
Weiqi Sun},
title = {Pizza: a task-oriented semantic parsing dataset},
url = {https://github.com/amazon-research/pizza-semantic-parsing-dataset},
year = {2021}
}