RasaHQ / rasa

💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Home Page:https://rasa.com/docs/rasa/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pre-trained embeddings not used as feature for CRFEntityExtractor

liaeh opened this issue · comments

commented

Rasa version: 2.7.1

Rasa SDK version (if used & relevant): 2.7.0

Rasa X version (if used & relevant):

Python version: 3.8.8

Operating system (windows, osx, ...): Windows-10-10.0.19041-SP0

Issue:
In the docs for CRFEntityExtractor component, it says:

If you want to pass custom features, such as pre-trained word embeddings, to CRFEntityExtractor, you can add any dense featurizer to the pipeline before the CRFEntityExtractor. CRFEntityExtractor automatically finds the additional dense features and checks if the dense features are an iterable of len(tokens), where each entry is a vector. A warning will be shown in case the check fails.

However, I get identical results when using different language models, or even no language model at all. I'm using Rasa NLU only for a simple entity extraction task. This leads me to think that the pre-trained embeddings are not getting passed on to the CRFEntityExtractor, despite LanguageModelFeaturizer generating dense features and no warnings indicating that the pretrained embeddings are not passed.

For example, when training a CRFEntityExtractor using config1/2/3 on the same train data and testing also on the same test set, I get identical precision/recall/f1 results.

Error (including full traceback):

Command or request that led to error:

rasa train nlu -c config1 (or 2 or 3)
rasa test nlu -c config1  (or 2 or 3)

Content of configuration file (config.yml) (if relevant):

Config 1

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 2

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3

language: en

pipeline:
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
    model_name: "distilbert"
    model_weights: "distilbert-base-uncased"
  - name: LexicalSyntacticFeaturizer
    "features": [
      # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Content of domain file (domain.yml) (if relevant):

Content of train data
I am just using a few utterances from the SNIPS dataset. Here's a small example of my train data.

version: '2.0'

nlu:
- intent: General
  examples: |
    - find a [restaurant](restaurant_type) for [marylou and i](party_size_description) [within walking distance](spatial_relation) of [my mum s hotel](poi)
    - book a table at a [bar](restaurant_type) in [cambodia](country) that serves [cheese fries](served_dish)
    - i m in [bowling green](poi) please book a [restaurant](restaurant_type) for [1](party_size_number) [close by](spatial_relation)
    - book a [restaurant](restaurant_type) at a [steakhouse](restaurant_type) [around](spatial_relation) [in town](poi) that serves [empanada](served_dish) for [me and my son](party_size_description)
    - book me a table for [me and my nephew](party_size_description) [near](spatial_relation) [my location](poi) at an [indoor](facility) [pub](restaurant_type)
    - book a table for [me and belinda](party_size_description) serving [minestra](served_dish) in a [bar](restaurant_type)
    - i need seating for [ten](party_size_number) people at a [bar](restaurant_type) that serves [czech](cuisine) cuisine
    - book a spot for [connie earline and rose](party_size_description) at an [oyster bar](restaurant_type) that serves [chicken fried bacon](served_dish) in [beauregard](city) [delaware](state)
    - reserve a table for [two](party_size_number) at a [restaurant](restaurant_type) which serves [creole](cuisine) [around](spatial_relation) here in [myanmar](country)
    - take me a [top-rated](sort) [restaurant](restaurant_type) for [nine](party_size_number) [close](spatial_relation) to [westfield](city) [delaware](state)
    - book a [joint restaurant](restaurant_type) for [four](party_size_number) with an [outdoor](facility) [area within the same area](spatial_relation) as [borough de denali](poi)
    - make reservations for [7](party_size_number) people at a [top-rated](sort) [brazilian](cuisine) [pub](restaurant_type) [around](spatial_relation) [rockaway park-beach 116th](poi)
    - need to book a table [downtown](poi) [within walking distance](spatial_relation) of me at [j g melon](restaurant_name)

Definition of done

  • Determine if this is only a documentation issue by looking through 1.4, 1.5, and 2.x + asking the research team
  • If so, then we should update the docs and add warnings
  • Otherwise, create another issue for addressing this bug
  • Reviewed by @koernerfelicia

Thanks for raising this issue, @lty4 will get back to you about it soon✨

Please also check out the docs and the forum in case your issue was raised there too 🤗

I'm checking this out right now. My gut feeling is that the LanguageModelTokeniser is meant to handle the BytePair tokeniser that's inside of huggingface. You're using it as a tokeniser for Rasa, so I imagine that's where something goes awry.

Note, the LanguageModelTokeniser is also deprecated.

I might also ask, is there a reason why you weren't using DIET?

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

commented

Correction!

I was able to reproduce the issue. These two pipelines yield the same results.

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: CRFEntityExtractor
pipeline:
  - name: WhitespaceTokenizer
  - name: CRFEntityExtractor

Even the confidence values are the same (confirmed via rasa shell nlu).

Great, glad you were able to reproduce it.

The reason I'm not using DIET is because I want to have a benchmark on how the NLU pipeline performs with/without finetuning a transformer model.

In the meantime then; you can turn off the transformer layers inside of DIET. That way you can still get your measurement.

This looks like an investigation issue where the definition of done would involve producing a simple example (possibly just the one that @koaning used, once he shares it), identifying the root cause, and creating a followup issue to implement and test the fix.

I am not completely sure but it looks like a documentation issue. As far as I can remember, CRFEntityExtractor never used dense features (word embeddings) and always relied on syntactic features... I even cross-checked on 1.10.x branch and at a quick look it doesn't seem to be using the word embeddings of tokens for training a model.
As I said, this is just a speculation... We should check it thoroughly.

Tab on reproducing this issue across rasa versions, starting from rasa v1.4.0. I have used the full SNIPS data for reproduction. Note that the configs will differ for older versions as a lot of the functionality wasn't available in earlier versions (rasa 1.4.0 is from June 2019!). All done with python 3.7.6.

  • identical performance in rasa v1.4.0 (using a slightly different config)
  • identical performance in rasa v1.5.3 (using a slightly different config)
  • identical performance in rasa v1.6.1 (using a slightly different config)
  • struggling to install rasa 1.7.x but I assume it won't deviate from the rest

  • identical performance in rasa v1.8.3 (using a minimally different config)
  • identical performance in rasa v1.9.7 (using a minimally different config)
  • identical performance in rasa v1.10.26 (using a minimally different config)
  • identical performance in rasa v2.0.8 (using a minimally different config)
  • identical performance in rasa v2.1.3 (using a minimally different config)
  • identical performance in rasa v2.2.10 (using a minimally different config)
  • identical performance in rasa v2.3.5 (using a minimally different config)
  • identical performance in rasa v2.4.3 (using a minimally different config)
  • identical performance in rasa v2.5.2 (using a minimally different config)
  • identical performance in rasa v2.6.3 (using a minimally different config)
  • identical performance in rasa v2.7.2 (using a minimally different config)
  • identical performance in rasa v2.8.3 (using a minimally different config)

Configs used for rasa 1.4.0, 1.5.3, 1.6.1:

  • config 1:

    language: "en"
    
    pipeline:
    - name: "SpacyNLP"
    - name: "SpacyTokenizer"
    - name: "SpacyFeaturizer"
    - name: "CRFEntityExtractor"
  • config 2:

    language: "en"
    
    pipeline:
    - name: "SpacyNLP"
    - name: "SpacyTokenizer"
    - name: "CRFEntityExtractor"

Configs used for rasa 1.8.3, 1.9.7, 1.10.26, 2.0.8, 2.1.3, 2.2.10, 2.3.5, 2.4.3, 2.5.2, 2.6.3, 2.7.2, 2.8.3:

  • config 1:

    language: en
    
    pipeline:
      - name: HFTransformersNLP
      - name: LanguageModelTokenizer
      - name: LexicalSyntacticFeaturizer
        "features": [
          # features for the word preceding the word being evaluated
          [ "suffix2", "prefix2" ],
          # features for the word being evaluated
          [ "BOS", "EOS" ],
          # features for the word following the word being evaluated
          [ "suffix2", "prefix2" ]]
      - name: CRFEntityExtractor
  • config 2:

    language: en
    
    pipeline:
      - name: HFTransformersNLP
      - name: LanguageModelTokenizer
      - name: LanguageModelFeaturizer
        model_name: "roberta"
        model_weights: "roberta-base"
      - name: LexicalSyntacticFeaturizer
        "features": [
          # features for the word preceding the word being evaluated
          [ "suffix2", "prefix2" ],
          # features for the word being evaluated
          [ "BOS", "EOS" ],
          # features for the word following the word being evaluated
          [ "suffix2", "prefix2" ]]
      - name: CRFEntityExtractor
    

Given the reproduction above, this looks like a docs issue. CRFEntityExtractor has indeed never used any dense word embeddings.

I had a closer look into the code for CRFEntityExtractor and whats slightly odd is that the feature preprocessing does extract dense features from whatever embedding model you specify and adds them to the CRFToken, but then ignores them again when building the X_train matrix thats used for training the eventual sklearn_crfsuite CRF. @dakshvar22 do you know anything more about the use of dense features in the CRFEntityExtractor?

Given this is unused code, we should probably remove it? (@TyDunn might need an updated definition of done if we want to remove the unused code).

I had another poke at the issue and it is possible to make CRFEntityExtractor use dense embeddings. The 3 configs below all give different results. It looks like its more of a documentation issue now as its not documented how to configure CRFEntityExtractor to use dense features (so the code in question from the above comment is indeed very much in use). I am also not sure whether this is an intended or an accidental feature (@TyDunn or @dakshvar22 might know more?).

Config 1:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

Config 2:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LanguageModelFeaturizer
    model_name: "roberta"
    model_weights: "roberta-base"
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor

Config 3:

language: en

pipeline:
  - name: WhitespaceTokenizer
  - name: LexicalSyntacticFeaturizer
    "features": [
        # features for the word preceding the word being evaluated
      [ "suffix2", "prefix2" ],
      # features for the word being evaluated
      [ "BOS", "EOS" ],
      # features for the word following the word being evaluated
      [ "suffix2", "prefix2" ]]
  - name: CRFEntityExtractor
    "features": [["text_dense_features"]]

@tttthomasssss I am sure it's accidental that the documentation lacks information on how to use dense features. We should add it if it's not already there.

Merged with #9572.