MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow

Home Page:https://maif.github.io/melusine

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NeuralMode' object has no attribute 'bert_tokenizer'

kmezhoud opened this issue · comments

Dear all,
I can not run the following training from R environment.

from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

nn_model = NeuralModel(architecture_function=rnn_model,
                        pretrained_embedding=pretrained_embedding,
                        text_input_column="clean_body",
                        meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
                        n_epochs=10)
SyntaxError: unexpected EOF while parsing (<string>, line 2)
Erreur dans py_call_impl(callable, dots$args, dots$keywords) : 
  AttributeError: 'NeuralModel' object has no attribute 'bert_tokenizer'

Detailed traceback:
  File "/usr/lib/python3.6/pprint.py", line 144, in pformat
    self._format(object, sio, 0, 0, {}, 0)
  File "/usr/lib/python3.6/pprint.py", line 161, in _format
    rep = self._repr(object, context, level)
  File "/usr/lib/python3.6/pprint.py", line 393, in _repr
    self._depth, level)
  File "/usr/lib/python3.6/pprint.py", line 405, in format
    return _safe_repr(object, context, maxlevels, level)
  File "/usr/lib/python3.6/pprint.py", line 555, in _safe_repr
    rep = repr(object)
  File "/home/kirus/.local/lib/python3.6/site-packages/sklearn/base.py", line 260, in __repr__
    repr_ = pp.pformat(self)
  File "/usr/lib/python3.6/pprint.py", line 144, in pformat
    self._format(object, sio, 0, 0, {}, 0)
  File "/usr/lib/python3.6/pprint.py", line 161, in _format
    rep = self._repr(object, context, level)
  File "/usr/lib/python3.6/pprint.py", line 393, in _repr
    self._dept

When I import cnn_model class, I got:

from melusine.models.neural_architectures import cnn_model
2021-09-16 11:13:08.221051: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib/R/lib::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server
2021-09-16 11:13:08.221084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Any idea is welcome. Thanks

I have libcudart9.1 and not libcudart.so10.1.

sudo apt list | grep libcudart
libcudart9.1/bionic,now 9.1.85-3ubuntu1 amd64  [installé]

I do not have GPU set up, nor nvidia card

lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 6100 (rev 09)

Python version : 3.6

Melusine version : 2.3.1

Operating System : 18.04.5 LTS on Macbook Pro Retina

Python Module versions

kirus@izis:~$ pip3.6 list
Package                 Version
----------------------- -------------------
absl-py                 0.13.0
altair                  4.1.0
appdirs                 1.4.4
apt-xapian-index        0.47
argon2-cffi             21.1.0
asn1crypto              0.24.0
astor                   0.8.1
astunparse              1.6.3
async-generator         1.10
attrs                   21.2.0
backcall                0.2.0
backports.zoneinfo      0.2.1
base58                  2.1.0
bleach                  4.1.0
blinker                 1.4
blis                    0.7.4
cachetools              4.2.2
catalogue               2.0.6
certifi                 2018.1.18
cffi                    1.14.4
chardet                 3.0.4
charset-normalizer      2.0.4
click                   7.1.2
coala-utils             0.7.0
colorama                0.3.9
command-not-found       0.3
contextvars             2.4
cryptography            2.1.4
cupshelpers             1.0
cycler                  0.10.0
cymem                   2.0.5
dataclasses             0.8
dbr                     8.0.1
decorator               5.0.9
defusedxml              0.7.1
dependency-management   0.4.0
distro-info             0.18ubuntu0.18.04.1
Django                  2.0.3
django-sslserver        0.20
en-core-web-sm          3.1.0
entrypoints             0.3
filelock                3.0.12
flashtext               2.7
fr-core-news-md         3.1.0
gast                    0.3.3
gensim                  4.1.0
gitdb2                  2.0.3
GitPython               2.1.9
google-auth             1.35.0
google-auth-oauthlib    0.4.6
google-pasta            0.2.0
grpcio                  1.40.0
h5py                    2.10.0
httplib2                0.11.3
idna                    2.6
immutables              0.16
importlib-metadata      4.8.1
importlib-resources     5.2.2
ipykernel               5.5.5
ipython                 7.16.1
ipython-genutils        0.2.0
ipywidgets              7.6.4
jedi                    0.18.0
Jinja2                  3.0.1
joblib                  1.0.1
jsonschema              3.2.0
jupyter-client          7.0.2
jupyter-core            4.7.1
jupyterlab-pygments     0.1.2
jupyterlab-widgets      1.0.1
Keras-Preprocessing     1.1.2
keyring                 10.6.0
keyrings.alt            3.0
kiwisolver              1.3.1
language-selector       0.1
libzbar-cffi            0.2.1
Markdown                3.3.4
MarkupSafe              2.0.1
matplotlib              3.3.4
melusine                2.3.1
mistune                 0.8.4
murmurhash              1.0.5
nbclient                0.5.4
nbconvert               6.0.7
nbformat                5.1.3
nest-asyncio            1.5.1
netifaces               0.10.4
nltk                    3.6.2
notebook                6.4.3
numpy                   1.18.5
oauthlib                3.1.1
olefile                 0.45.1
opt-einsum              3.3.0
packaging               21.0
pandas                  1.1.5
pandocfilters           1.4.3
parso                   0.8.2
pathy                   0.6.0
pbr                     4.0.3
pexpect                 4.2.1
pickleshare             0.7.5
Pillow                  8.3.2
pip                     21.2.4
plotly                  5.3.1
preshed                 3.0.5
prometheus-client       0.11.0
prompt-toolkit          3.0.20
protobuf                3.17.3
ptyprocess              0.7.0
pyarrow                 5.0.0
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pycairo                 1.16.2
pycparser               2.20
pycrypto                2.6.1
pycups                  1.9.73
pydantic                1.8.2
pydeck                  0.6.2
Pygments                2.10.0
PyGObject               3.26.1
pyparsing               2.4.7
PyPrint                 0.2.6
pyrsistent              0.18.0
python-apt              1.6.5+ubuntu0.6
python-dateutil         2.8.2
python-debian           0.1.32
pytz                    2018.3
pyxdg                   0.25
PyYAML                  3.12
pyzbar                  0.1.7
pyzmq                   22.2.1
regex                   2021.8.28
reportlab               3.4.0
requests                2.26.0
requests-oauthlib       1.3.0
requests-unixsocket     0.1.5
rsa                     4.7.2
sacremoses              0.0.45
sarge                   0.1.6
scikit-learn            0.24.2
scipy                   1.5.4
scour                   0.36
SecretStorage           2.3.1
Send2Trash              1.8.0
sentencepiece           0.1.96
setuptools              58.0.4
simplemma               0.3.0
six                     1.16.0
smart-open              5.2.1
smmap                   3.0.4
smmap2                  3.0.1
spacy                   3.1.2
spacy-legacy            3.0.8
srsly                   2.4.1
streamlit               0.88.0
systemd-python          234
tenacity                8.0.1
tensorboard             2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.0
tensorflow              2.3.4
tensorflow-estimator    2.3.0
termcolor               1.1.0
terminado               0.12.1
testfixtures            5.3.1
testpath                0.5.0
thinc                   8.0.10
threadpoolctl           2.1.0
tokenizers              0.9.2
toml                    0.10.2
toolz                   0.11.1
tornado                 6.1
tqdm                    4.62.2
traitlets               4.3.3
transformers            3.4.0
typer                   0.3.2
typing-extensions       3.10.0.2
tzlocal                 3.0
ubuntu-advantage-tools  27.2
ubuntu-drivers-common   0.0.0
ufw                     0.36
unattended-upgrades     0.1
Unidecode               1.3.1
urllib3                 1.22
usb-creator             0.3.3
validators              0.18.2
vboxapi                 1.0
virtualenv              16.2.0
vulture                 0.10
wasabi                  0.8.2
watchdog                2.1.5
wcwidth                 0.2.5
webencodings            0.5.1
Werkzeug                2.0.1
wheel                   0.30.0
whitenoise              3.3.1
widgetsnbextension      3.5.1
wordcloud               1.8.1
wrapt                   1.12.1
xkit                    0.0.0
zipp                    3.5.0

When I run the tutorial08 from a script, I got this:

python3.6 full_pipeline_melusine.py 

16/09 12:16 - melusine.nlp_tools.phraser - INFO                               - Start training for colocation detector
16/09 12:16 - melusine.nlp_tools.phraser - INFO                               - Done.
2021-09-16 12:16:14.114440: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 12:16:14.114489: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "full_pipeline_melusine.py", line 122, in <module>
    nn_model.fit(X,y)
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
    X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
    self._get_embedding_matrix()
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 385, in _get_embedding_matrix
    self.vocabulary = pretrained_embedding.embedding.index_to_key
AttributeError: 'NoneType' object has no attribute 'index_to_key'

Hi Kirus, thank you for this detailled issue.
The team plans on brainstorming next week to handle the open issues.
We will take time to go through yours and provide for a solution.

For now, on the second point, i think I already encountered this issue and fixed it by replacing line 385
self.vocabulary = pretrained_embedding.embedding.index_to_key by self.vocabulary = pretrained_embedding.embedding
This will require you to call melusine locally to modify the code. This is due to Gensim upgrade.
Let us know if this works. It is just a quick thought but we will go through it more deeply soon to provide with a complete solution.

Thank you for your help and patience.
Best regards.

awk '{if(NR==385) print $0}' /home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py 
        self.vocabulary = pretrained_embedding.embedding
python3.6 full_pipeline_melusine.py 

16/09 03:47 - melusine.nlp_tools.phraser - INFO                               - Start training for colocation detector
16/09 03:47 - melusine.nlp_tools.phraser - INFO                               - Done.
2021-09-16 15:47:51.936383: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 15:47:51.936420: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "full_pipeline_melusine.py", line 122, in <module>
    nn_model.fit(X,y)
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
    X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
    self._get_embedding_matrix()
  File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 386, in _get_embedding_matrix
    vocab_size = len(self.vocabulary)
TypeError: object of type 'NoneType' has no len()

Thank you for your feedback.
We need to investigate more deeply then. We will come back to you soon.
Best,

Hello,

I tried to reproduce your bug as follows:

  • Clone the melusine repository
  • Create a conda env with python 3.6
  • Install dependencies ("pip install ." from inside the repo)
  • Install Jupyterlab
  • Run Tutorial 8

I was not able to reproduce your bug :/

From the error you posted it seems like the embedding attribute of your pretrained_embedding object has a None value which is surprising.

Could you make the following check ?

print(type(pretrained_embedding))  # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding))  # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key))  # This should be list

If the Gensim object is None, your bug probably comes from the embedding training.

I hope this helps, let us how it goes :)

Thanks!
I will focus on tutorial 04 to check embedding process versus my install.
I added the 3 prints in .py file and rerun the code. It gives the same error without any prints about the 3 objects.
thanks

here is the file

#!/usr/bin/ python3.6

from melusine.data.data_loader import load_email_data
import ast

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from melusine.utils.multiprocessing import apply_by_multiprocessing
from melusine.utils.transformer_scheduler import TransformerScheduler

from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer

from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body
from melusine.prepare_email.cleaning import clean_header

from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body
from melusine.nlp_tools.phraser import phraser_on_header
from melusine.nlp_tools.tokenizer import Tokenizer
from melusine.nlp_tools.embedding import Embedding
from melusine.summarizer.keywords_generator import KeywordsGenerator

from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import MetaAttachmentType
from melusine.prepare_email.metadata_engineering import Dummifier

df_emails = load_email_data()
df_emails['attachment'] = df_emails['attachment'].apply(ast.literal_eval)


ManageTransferReplyTransformer = TransformerScheduler(
    functions_scheduler=[
        (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
        (update_info_for_transfer_mail, None, None),
        (add_boolean_answer, None, ['is_answer']),
        (add_boolean_transfer, None, ['is_transfer'])
    ]
)

df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)


SegmentingTransformer = TransformerScheduler(
    functions_scheduler=[
        (build_historic, None, ['structured_historic']),
        (structure_email, None, ['structured_body'])
    ]
)

df_emails = SegmentingTransformer.fit_transform(df_emails)


LastBodyHeaderCleaningTransformer = TransformerScheduler(
    functions_scheduler=[
        (extract_last_body, None, ['last_body']),
        (clean_body, None, ['clean_body'])
    ]
)
df_emails = LastBodyHeaderCleaningTransformer.fit_transform(df_emails)


phraser = Phraser()
phraser.train(df_emails)

PhraserTransformer = TransformerScheduler(
    functions_scheduler=[
        (phraser_on_body, (phraser,), ['clean_body'])
    ]
)

df_emails = PhraserTransformer.fit_transform(df_emails)


tokenizer = Tokenizer(input_column="clean_body")
df_emails = tokenizer.fit_transform(df_emails)


# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
    ('MetaExtension', MetaExtension()),
    ('MetaDate', MetaDate()),
    ('MetaAttachmentType', MetaAttachmentType()),
    ('Dummifier', Dummifier())
])

df_meta = MetadataPipeline.fit_transform(df_emails)

keywords_generator = KeywordsGenerator(n_max_keywords=4)
df_emails = keywords_generator.fit_transform(df_emails)


pretrained_embedding = Embedding(input_column='clean_body',
                                 workers=1,
                                 min_count=5)
                                 
print(type(pretrained_embedding))                                  
import pandas as pd 
from sklearn.preprocessing import LabelEncoder

X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)


from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel

nn_model = NeuralModel(architecture_function=cnn_model,
                        pretrained_embedding=pretrained_embedding,
                        text_input_column="clean_body",
                        meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
                        n_epochs=10)

nn_model.fit(X,y)

y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res
print(y_res)

print(type(pretrained_embedding))  # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding))  # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key))  # This should be list

import numpy as np
import pandas as pd
prediction = pd.DataFrame(y_res, columns=['predictions']).to_csv('prediction.csv')

Hello @kmezhoud,
I think I found where your problem comes from.
Have you trained your embedding object ?

When you instantiate a (Melusine) Embedding object, the attribute Embedding.embedding is initialized at None.
(Seems to be your problem)
When you run the train method, the Embedding.embedding attribute is set to a gensim.models.keyedvectors.KeyedVectors object.

Hope it helps

Hi,
I guess training the embedding before instanciating a NeuralModel should fix the bug.
I'll close the issue, feel free to re-open if needed.