Pre-trained embeddings not used as feature for CRFEntityExtractor

Question

Pre-trained embeddings not used as feature for CRFEntityExtractor

liaeh opened this issue 3 years ago · comments

liaeh commented 3 years ago

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, ...): Windows-10-10.0.19041-SP0

Issue:
In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I'm using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):

Content of train data
I am just using a few utterances from the SNIPS dataset. Here's a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
If so, then we should update the docs ~~and add warnings~~
~~Otherwise, create another issue for addressing this bug~~
Reviewed by @koernerfelicia

Sara-tagger · Answer 1 · Tue Jun 22 2021 20:00:15 GMT+0800 (China Standard Time)

Thanks for raising this issue, @lty4 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

liaeh · Answer 2 · Wed Jun 23 2021 21:47:12 GMT+0800 (China Standard Time)

This issue is described in two posts on the Rasa forum as well (second one by me):

https://forum.rasa.com/t/crf-with-dense-features/44691

https://forum.rasa.com/t/no-difference-in-performance-when-using-or-changing-language-model-featurizers/44882

vincent d warmerdam · Answer 3 · Fri Jun 25 2021 17:25:38 GMT+0800 (China Standard Time)

I'm checking this out right now. My gut feeling is that the LanguageModelTokeniser is meant to handle the BytePair tokeniser that's inside of huggingface. You're using it as a tokeniser for Rasa, so I imagine that's where something goes awry.

Note, the LanguageModelTokeniser is also deprecated.

vincent d warmerdam · Answer 4 · Fri Jun 25 2021 17:39:40 GMT+0800 (China Standard Time)

I might also ask, is there a reason why you weren't using DIET?

vincent d warmerdam · Answer 5 · Mon Jun 28 2021 15:17:19 GMT+0800 (China Standard Time)

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor

pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

liaeh · Answer 6 · Mon Jun 28 2021 15:35:54 GMT+0800 (China Standard Time)

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.
pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor
Even the confidence values are the same (confirmed via rasa shell nlu).

Great, glad you were able to reproduce it.

The reason I'm not using DIET is because I want to have a benchmark on how the NLU pipeline performs with/without finetuning a transformer model.

vincent d warmerdam · Answer 7 · Mon Jun 28 2021 15:38:10 GMT+0800 (China Standard Time)

In the meantime then; you can turn off the transformer layers inside of DIET. That way you can still get your measurement.

Sam Sucik · Answer 8 · Fri Aug 06 2021 20:34:26 GMT+0800 (China Standard Time)

This looks like an investigation issue where the definition of done would involve producing a simple example (possibly just the one that @koaning used, once he shares it), identifying the root cause, and creating a followup issue to implement and test the fix.

Daksh Varshneya · Answer 9 · Fri Aug 06 2021 20:57:30 GMT+0800 (China Standard Time)

I am not completely sure but it looks like a documentation issue. As far as I can remember, CRFEntityExtractor never used dense features (word embeddings) and always relied on syntactic features... I even cross-checked on 1.10.x branch and at a quick look it doesn't seem to be using the word embeddings of tokens for training a model.
As I said, this is just a speculation... We should check it thoroughly.

Thomas · Answer 10 · Wed Aug 18 2021 15:45:24 GMT+0800 (China Standard Time)

Tab on reproducing this issue across rasa versions, starting from rasa v1.4.0. I have used the full SNIPS data for reproduction. Note that the configs will differ for older versions as a lot of the functionality wasn't available in earlier versions (rasa 1.4.0 is from June 2019!). All done with python 3.7.6.

identical performance in rasa v1.4.0 (using a slightly different config)
identical performance in rasa v1.5.3 (using a slightly different config)
identical performance in rasa v1.6.1 (using a slightly different config)
struggling to install rasa 1.7.x but I assume it won't deviate from the rest

Configs used for rasa `1.4.0, 1.5.3, 1.6.1`:

config 1:

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "CRFEntityExtractor"

config 2:

language: "en"

pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "CRFEntityExtractor"

Configs used for rasa `1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3`:

config 1:

language: en

pipeline:
  - name: HFTransformersNLP
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

config 2:

language: en

pipeline:
  - name: HFTransformersNLP
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Thomas · Answer 11 · Thu Aug 26 2021 17:26:32 GMT+0800 (China Standard Time)

Given the reproduction above, this looks like a docs issue. CRFEntityExtractor has indeed never used any dense word embeddings.

Thomas · Answer 12 · Fri Aug 27 2021 20:19:02 GMT+0800 (China Standard Time)

I had a closer look into the code for CRFEntityExtractor and whats slightly odd is that the feature preprocessing does extract dense features from whatever embedding model you specify and adds them to the CRFToken, but then ignores them again when building the X_train matrix thats used for training the eventual sklearn_crfsuite CRF. @dakshvar22 do you know anything more about the use of dense features in the CRFEntityExtractor?

Given this is unused code, we should probably remove it? (@TyDunn might need an updated definition of done if we want to remove the unused code).

Thomas · Answer 13 · Tue Aug 31 2021 22:31:41 GMT+0800 (China Standard Time)

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Daksh Varshneya · Answer 14 · Mon Sep 06 2021 20:41:43 GMT+0800 (China Standard Time)

@tttthomasssss I am sure it's accidental that the documentation lacks information on how to use dense features. We should add it if it's not already there.

Thomas · Answer 15 · Sat Sep 18 2021 03:50:22 GMT+0800 (China Standard Time)

Merged with #9572.

Pre-trained embeddings not used as feature for CRFEntityExtractor

Config 1

Config 2

Config 3

Please also check out the docs and the forum in case your issue was raised there too 🤗

Configs used for rasa 1.4.0, 1.5.3, 1.6.1:

Configs used for rasa 1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3:

Configs used for rasa `1.4.0, 1.5.3, 1.6.1`:

Configs used for rasa `1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3`: