The FoodOrdering Dataset

The FoodOrdering dataset is a task-oriented parsing dataset in the food-ordering domain with utterances and annotations derived from the menus of five venues characteristic of that business vertical: burgers, burritos, coffees, pizzas, and subs.

Human generated data

For each restaurant, human generated data was collected through Mechanical Turk, where the proposed task consisted in formulating a natural language request of an order within a provided menu, for one or multiple persons.

The collected utterances were then manually annotated into machine executable representations (EXR):

Here are examples from the 4 restaurants/datasets newly contributed: burger, burrito, coffee, and sub

==> data/burger/dev.json <==
{
    "SRC": "i would like a vegan burger with lettuce tomatoes and onions and a large order of sweet potato fries",
    "EXR": "(MAIN_DISH_ORDER (NUMBER 1 ) (MAIN_DISH_TYPE vegan_burger ) (TOPPING lettuce ) (TOPPING tomato ) (TOPPING onion ) )
            (SIDE_ORDER (NUMBER 1 ) (SIZE large ) (SIDE_TYPE sweet_potato_fries ) )"
}

==> data/burrito/dev.json <==
{
    "SRC": "let me have a steak white rice and black bean burrito with red chili salsa a side of guacamole and a coke",
    "EXR": "(SIDE_ORDER (NUMBER 1 ) (SIDE_TYPE guacamole ) )
            (BURRITO_ORDER (NUMBER 1 ) (MAIN_FILLING steak ) (RICE_FILLING white_rice ) (BEAN_FILLING black_beans ) (SALSA_TOPPING red_chili_salsa ) )
            (DRINK_ORDER (NUMBER 1 ) (DRINK_TYPE mexican_coca-cola ) )"
}

==> data/coffee/dev.json <==
{
    "SRC": "i would like a regular latte cinnamon iced with one extra espresso shot",
    "EXR": "(DRINK_ORDER (NUMBER 1 ) (SIZE regular ) (DRINK_TYPE latte ) (ROAST_TYPE cinnamon_roast ) (STYLE iced ) (TOPPING (ESPRESSO_SHOT 1 ) ) )"
}


==> data/sub/dev.json <==
{
    "SRC": "i would like a cold cut combo with mayo pickles banana peppers tomato lettuce and pepper jack cheese",
    "EXR": "(SANDWICH_ORDER (NUMBER 1 ) (BASE_SANDWICH cold_cut_combo ) (TOPPING regular_mayonnaise ) (TOPPING pickles ) (TOPPING banana_peppers ) (TOPPING tomatoes ) (TOPPING lettuce ) (TOPPING pepperjack ) )"
}

while the 5th restaurant called pizza comes from https://github.com/amazon-research/pizza-semantic-parsing-dataset.

==> data/pizza/dev.json <==
{
    "SRC": "i want to order two medium pizzas with sausage and black olives and two medium pizzas with pepperoni and extra cheese and three large pizzas with pepperoni and sausage",
    "EXR": "(PIZZAORDER (NUMBER 2 ) (SIZE medium ) (COMPLEX (QUANTITY extra ) (TOPPING cheese ) ) (TOPPING pepperoni ) )
            (PIZZAORDER (NUMBER 2 ) (SIZE medium ) (TOPPING olives ) (TOPPING sausage ) )
            (PIZZAORDER (NUMBER 3 ) (SIZE large ) (TOPPING pepperoni ) (TOPPING sausage ) )"
}

Synthetically generated data

We are providing synthetic data for 3 of the 5 restaurants.

For pizza we are sub-sampling 10,000 utterances from the 2.5M provided in https://github.com/amazon-research/pizza-semantic-parsing-dataset.
For the burrito and sub skills, we designed templates utterance such as:

please get me a {size} burrito with {topping1} and {topping2} but no {topping3}

and sampled the slot values (size and topping1, topping2, topping3 here) in the templates. The slot values were obtained from catalogs predefined for each restaurant menu. The templates and values were sampled to obtain 10,000 unique utterances for each of the sub, burrito and pizza menus.

Instead of using an executable representation for target semantics, we use a format called TOP-Alias, which is reminiscent of the TOP-Decoupled format. See our publication for more details on how those representation differ.

In the case of synthetically generated data, the two are identical so we will defer the details to the publication.

Note that the EXR format can be directly obtained from the TOP-Alias format.

Here are examples of the synthetically generated data:

==> data/burrito/train.json <==
{
    "SRC": "i'd prefer four quesadillas with pork cauliflower black beans and grilled veggies with pico de gallo on top",
    "TOPALIAS": "(QUESADILLA_ORDER (NUMBER four ) (MAIN_FILLING pork ) (RICE_FILLING cauliflower ) (BEAN_FILLING black beans ) (TOPPING grilled veggies ) (SALSA_TOPPING pico de gallo ) )"
}

==> data/pizza/train.json <==
{
    "SRC": "three large pizzas with pecorino cheese and without tuna",
    "TOPALIAS": "(PIZZAORDER (NUMBER three ) (SIZE large ) (TOPPING pecorino cheese ) (NOT (TOPPING tuna ) ) )"
}

==> data/sub/train.json <==
{
    "SRC": "can you please order me four meatball marinara sandwiches not many honey mustard",
    "TOPALIAS": "(SANDWICH_ORDER (NUMBER four ) (BASE_SANDWICH meatball marinara ) (COMPLEX (QUANTITY not many ) (TOPPING honey mustard ) ) )"
}

More details on the dataset conventions and construction can be found in the paper, but at the high level each of the 5 datasets semantics is composed of intents and slots:

intents nodes - like DRINK_ORDER or MAIN_DISH_ORDER - root a subtree of semantics expressing one general intent in an overall multi-intent request, for example to order a main dish and a side in one single order. Those nodes have no parent nodes.
slot nodes - like SIZE or TOPPING - which have slot values as children (large, cream). They are combined with other slots values to qualify the semantics of the higher level intent.
slots can be negated: without cream will be expressed as (NOT (TOPPING whipped_cream ) )
slots can be qualified: extra whipped cream will expressed as (COMPLEX (QUANTITY extra ) (TOPPING whipped_cream ) )
intents can have one or more slots as children nodes, but slots cannot have intents as children nodes.

Dataset statistics

In the first table we give high level statistics of the skills' schemas in terms of number of intents, slots and slot values. The detailed schemas and catalogs can be found in ./data/*/schema.json and ./data/*/alias/*.

Dataset	# of intents	# of slots	#of slot values (entities)
burrito	7	12	34
sub	3	8	62
pizza (external)	2	11	166
burger	3	9	44
coffee	1	10	43

In the table below, we give relevant utterance-level statistics describing the human generated data:

Dataset	#utterances	#intent/utt	#slots per utt	Avg depth
burrito	191	1.39	5.78	3.12
sub	162	1.69	5.99	3.07
pizza (external)	348	1.25	6.13	3.62
burger	161	1.97	7.17	3.04
coffee	101	1.05	5.34	3.2

and the synthetic data:

Dataset	#utterances	#intent/utt	#slots per utt	Avg depth
burrito	9,982	1.57	6.50	3.48
sub	10,000	1.79	6.24	3.37
pizza (external)	10,000	1.77	5.77	3.44

NOTE1: No synthetic data was generated for burger and coffeeas they are used to demonstrate zero-shot learning. NOTE2: Orders can be multi-intents (asking for a main dish as well as drinks), hence the depth computed aboves assumes the presence of a higher level ORDER node encapsulating all intents, but not explicitly present in target semantic strings.

Repository structure

The repo structure is as follows:

FoodOrderingDataset
|
|____ data
|    |
|    |_____ burger
|    |    |____ dev.json # the human generated/annotated utterances
|    |    |
|    |    |____ schema.json # the schema of intents and slots
|    |    |
|    |    |____ alias # the catalog values for slots
|    |
|    |_____ burrito
|    |    |____ dev.json # the human generated/annotated utterances
|    |    |
|    |    |____ train.json # the synthetically generated data
|    |    |
|    |    |____ schema.json # the schema of intents and slots
|    |    |
|    |    |____ alias # the catalog values for slots
|    |
|    |_____ coffee
|    |    |____ dev.json # the human generated/annotated utterances
|    |    |
|    |    |____ schema.json # the schema of intents and slots
|    |    |
|    |    |____ alias # the catalog values for slots
|    |
|    |_____ pizza
|    |    |____ dev.json # [external] the dev set from https://github.com/amazon-research/pizza-semantic-parsing-dataset
|    |    |
|    |    |____ train.json # [external] a subset of 10,000 examples taken from training portion of https://github.com/amazon-research/pizza-semantic-parsing-dataset
|    |    |
|    |    |____ schema.json # the schema of intents and slots
|    |    |
|    |    |____ alias # the catalog values for slots, adapted from https://github.com/amazon-research/pizza-semantic-parsing-dataset
|    |
|    |_____ sub
|        |____ dev.json # the human generated/annotated utterances
|        |
|        |____ train.json # the synthetically generated data
|        |
|        |____ schema.json # the schema of intents and slots
|        |
|        |____ alias # the catalog values for slots
|
|
|____ README.md

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC-4.0 License.

How to cite?

If you use this dataset, please cite the following paper:

@inproceedings{a-rubino-etal-2022-cross,
    title = "Cross-{TOP}: Zero-Shot Cross-Schema Task-Oriented Parsing",
    author = "A. Rubino, Melanie  and
      Guenon des mesnards, Nicolas  and
      Shah, Uday  and
      Jiang, Nanjiang  and
      Sun, Weiqi  and
      Arkoudas, Konstantine",
    booktitle = "Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing",
    month = jul,
    year = "2022",
    address = "Hybrid",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.deeplo-1.6",
    pages = "48--60",
    abstract = "t",
}

and the original PIZZA dataset this work is derived from (see https://github.com/amazon-research/pizza-semantic-parsing-dataset):

@misc{pizzaDataset,
	author = {Konstantine Arkoudas and 
	Nicolas Guenon des Mesnards and 
	Melanie Rubino and 
	Sandesh Swamy  and
	Saarthak Khanna and
	Weiqi Sun},
	title = {Pizza: a task-oriented semantic parsing dataset},
	url = {https://github.com/amazon-research/pizza-semantic-parsing-dataset},
	year = {2021}
}

Kalyani011 / food-ordering-semantic-parsing-dataset