NeuralMode' object has no attribute 'bert_tokenizer'
kmezhoud opened this issue · comments
Dear all,
I can not run the following training from R environment.
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
nn_model = NeuralModel(architecture_function=rnn_model,
pretrained_embedding=pretrained_embedding,
text_input_column="clean_body",
meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
n_epochs=10)
SyntaxError: unexpected EOF while parsing (<string>, line 2)
Erreur dans py_call_impl(callable, dots$args, dots$keywords) :
AttributeError: 'NeuralModel' object has no attribute 'bert_tokenizer'
Detailed traceback:
File "/usr/lib/python3.6/pprint.py", line 144, in pformat
self._format(object, sio, 0, 0, {}, 0)
File "/usr/lib/python3.6/pprint.py", line 161, in _format
rep = self._repr(object, context, level)
File "/usr/lib/python3.6/pprint.py", line 393, in _repr
self._depth, level)
File "/usr/lib/python3.6/pprint.py", line 405, in format
return _safe_repr(object, context, maxlevels, level)
File "/usr/lib/python3.6/pprint.py", line 555, in _safe_repr
rep = repr(object)
File "/home/kirus/.local/lib/python3.6/site-packages/sklearn/base.py", line 260, in __repr__
repr_ = pp.pformat(self)
File "/usr/lib/python3.6/pprint.py", line 144, in pformat
self._format(object, sio, 0, 0, {}, 0)
File "/usr/lib/python3.6/pprint.py", line 161, in _format
rep = self._repr(object, context, level)
File "/usr/lib/python3.6/pprint.py", line 393, in _repr
self._dept
When I import cnn_model
class, I got:
from melusine.models.neural_architectures import cnn_model
2021-09-16 11:13:08.221051: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib:/usr/lib/R/lib::/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/lib/server
2021-09-16 11:13:08.221084: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Any idea is welcome. Thanks
I have libcudart9.1 and not libcudart.so10.1.
sudo apt list | grep libcudart
libcudart9.1/bionic,now 9.1.85-3ubuntu1 amd64 [installé]
I do not have GPU set up, nor nvidia card
lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation Iris Graphics 6100 (rev 09)
Python version : 3.6
Melusine version : 2.3.1
Operating System : 18.04.5 LTS on Macbook Pro Retina
Python Module versions
kirus@izis:~$ pip3.6 list
Package Version
----------------------- -------------------
absl-py 0.13.0
altair 4.1.0
appdirs 1.4.4
apt-xapian-index 0.47
argon2-cffi 21.1.0
asn1crypto 0.24.0
astor 0.8.1
astunparse 1.6.3
async-generator 1.10
attrs 21.2.0
backcall 0.2.0
backports.zoneinfo 0.2.1
base58 2.1.0
bleach 4.1.0
blinker 1.4
blis 0.7.4
cachetools 4.2.2
catalogue 2.0.6
certifi 2018.1.18
cffi 1.14.4
chardet 3.0.4
charset-normalizer 2.0.4
click 7.1.2
coala-utils 0.7.0
colorama 0.3.9
command-not-found 0.3
contextvars 2.4
cryptography 2.1.4
cupshelpers 1.0
cycler 0.10.0
cymem 2.0.5
dataclasses 0.8
dbr 8.0.1
decorator 5.0.9
defusedxml 0.7.1
dependency-management 0.4.0
distro-info 0.18ubuntu0.18.04.1
Django 2.0.3
django-sslserver 0.20
en-core-web-sm 3.1.0
entrypoints 0.3
filelock 3.0.12
flashtext 2.7
fr-core-news-md 3.1.0
gast 0.3.3
gensim 4.1.0
gitdb2 2.0.3
GitPython 2.1.9
google-auth 1.35.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
grpcio 1.40.0
h5py 2.10.0
httplib2 0.11.3
idna 2.6
immutables 0.16
importlib-metadata 4.8.1
importlib-resources 5.2.2
ipykernel 5.5.5
ipython 7.16.1
ipython-genutils 0.2.0
ipywidgets 7.6.4
jedi 0.18.0
Jinja2 3.0.1
joblib 1.0.1
jsonschema 3.2.0
jupyter-client 7.0.2
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.1
Keras-Preprocessing 1.1.2
keyring 10.6.0
keyrings.alt 3.0
kiwisolver 1.3.1
language-selector 0.1
libzbar-cffi 0.2.1
Markdown 3.3.4
MarkupSafe 2.0.1
matplotlib 3.3.4
melusine 2.3.1
mistune 0.8.4
murmurhash 1.0.5
nbclient 0.5.4
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
netifaces 0.10.4
nltk 3.6.2
notebook 6.4.3
numpy 1.18.5
oauthlib 3.1.1
olefile 0.45.1
opt-einsum 3.3.0
packaging 21.0
pandas 1.1.5
pandocfilters 1.4.3
parso 0.8.2
pathy 0.6.0
pbr 4.0.3
pexpect 4.2.1
pickleshare 0.7.5
Pillow 8.3.2
pip 21.2.4
plotly 5.3.1
preshed 3.0.5
prometheus-client 0.11.0
prompt-toolkit 3.0.20
protobuf 3.17.3
ptyprocess 0.7.0
pyarrow 5.0.0
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycairo 1.16.2
pycparser 2.20
pycrypto 2.6.1
pycups 1.9.73
pydantic 1.8.2
pydeck 0.6.2
Pygments 2.10.0
PyGObject 3.26.1
pyparsing 2.4.7
PyPrint 0.2.6
pyrsistent 0.18.0
python-apt 1.6.5+ubuntu0.6
python-dateutil 2.8.2
python-debian 0.1.32
pytz 2018.3
pyxdg 0.25
PyYAML 3.12
pyzbar 0.1.7
pyzmq 22.2.1
regex 2021.8.28
reportlab 3.4.0
requests 2.26.0
requests-oauthlib 1.3.0
requests-unixsocket 0.1.5
rsa 4.7.2
sacremoses 0.0.45
sarge 0.1.6
scikit-learn 0.24.2
scipy 1.5.4
scour 0.36
SecretStorage 2.3.1
Send2Trash 1.8.0
sentencepiece 0.1.96
setuptools 58.0.4
simplemma 0.3.0
six 1.16.0
smart-open 5.2.1
smmap 3.0.4
smmap2 3.0.1
spacy 3.1.2
spacy-legacy 3.0.8
srsly 2.4.1
streamlit 0.88.0
systemd-python 234
tenacity 8.0.1
tensorboard 2.6.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.0
tensorflow 2.3.4
tensorflow-estimator 2.3.0
termcolor 1.1.0
terminado 0.12.1
testfixtures 5.3.1
testpath 0.5.0
thinc 8.0.10
threadpoolctl 2.1.0
tokenizers 0.9.2
toml 0.10.2
toolz 0.11.1
tornado 6.1
tqdm 4.62.2
traitlets 4.3.3
transformers 3.4.0
typer 0.3.2
typing-extensions 3.10.0.2
tzlocal 3.0
ubuntu-advantage-tools 27.2
ubuntu-drivers-common 0.0.0
ufw 0.36
unattended-upgrades 0.1
Unidecode 1.3.1
urllib3 1.22
usb-creator 0.3.3
validators 0.18.2
vboxapi 1.0
virtualenv 16.2.0
vulture 0.10
wasabi 0.8.2
watchdog 2.1.5
wcwidth 0.2.5
webencodings 0.5.1
Werkzeug 2.0.1
wheel 0.30.0
whitenoise 3.3.1
widgetsnbextension 3.5.1
wordcloud 1.8.1
wrapt 1.12.1
xkit 0.0.0
zipp 3.5.0
When I run the tutorial08 from a script, I got this:
python3.6 full_pipeline_melusine.py
16/09 12:16 - melusine.nlp_tools.phraser - INFO - Start training for colocation detector
16/09 12:16 - melusine.nlp_tools.phraser - INFO - Done.
2021-09-16 12:16:14.114440: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 12:16:14.114489: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "full_pipeline_melusine.py", line 122, in <module>
nn_model.fit(X,y)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
self._get_embedding_matrix()
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 385, in _get_embedding_matrix
self.vocabulary = pretrained_embedding.embedding.index_to_key
AttributeError: 'NoneType' object has no attribute 'index_to_key'
Hi Kirus, thank you for this detailled issue.
The team plans on brainstorming next week to handle the open issues.
We will take time to go through yours and provide for a solution.
For now, on the second point, i think I already encountered this issue and fixed it by replacing line 385
self.vocabulary = pretrained_embedding.embedding.index_to_key by self.vocabulary = pretrained_embedding.embedding
This will require you to call melusine locally to modify the code. This is due to Gensim upgrade.
Let us know if this works. It is just a quick thought but we will go through it more deeply soon to provide with a complete solution.
Thank you for your help and patience.
Best regards.
awk '{if(NR==385) print $0}' /home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py
self.vocabulary = pretrained_embedding.embedding
python3.6 full_pipeline_melusine.py
16/09 03:47 - melusine.nlp_tools.phraser - INFO - Start training for colocation detector
16/09 03:47 - melusine.nlp_tools.phraser - INFO - Done.
2021-09-16 15:47:51.936383: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-09-16 15:47:51.936420: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "full_pipeline_melusine.py", line 122, in <module>
nn_model.fit(X,y)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 241, in fit
X_input_train, y_categorical_train = self._prepare_data(X_train, y_train)
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 494, in _prepare_data
self._get_embedding_matrix()
File "/home/kirus/.local/lib/python3.6/site-packages/melusine/models/train.py", line 386, in _get_embedding_matrix
vocab_size = len(self.vocabulary)
TypeError: object of type 'NoneType' has no len()
Thank you for your feedback.
We need to investigate more deeply then. We will come back to you soon.
Best,
Hello,
I tried to reproduce your bug as follows:
- Clone the melusine repository
- Create a conda env with python 3.6
- Install dependencies ("pip install ." from inside the repo)
- Install Jupyterlab
- Run Tutorial 8
I was not able to reproduce your bug :/
From the error you posted it seems like the embedding
attribute of your pretrained_embedding
object has a None value which is surprising.
Could you make the following check ?
print(type(pretrained_embedding)) # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding)) # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key)) # This should be list
If the Gensim object is None, your bug probably comes from the embedding training.
I hope this helps, let us how it goes :)
Thanks!
I will focus on tutorial 04 to check embedding process versus my install.
I added the 3 prints in .py file and rerun the code. It gives the same error without any prints about the 3 objects.
thanks
here is the file
#!/usr/bin/ python3.6
from melusine.data.data_loader import load_email_data
import ast
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from melusine.utils.multiprocessing import apply_by_multiprocessing
from melusine.utils.transformer_scheduler import TransformerScheduler
from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer
from melusine.prepare_email.manage_transfer_reply import update_info_for_transfer_mail
from melusine.prepare_email.manage_transfer_reply import add_boolean_transfer
from melusine.prepare_email.manage_transfer_reply import add_boolean_answer
from melusine.prepare_email.build_historic import build_historic
from melusine.prepare_email.mail_segmenting import structure_email
from melusine.prepare_email.body_header_extraction import extract_last_body
from melusine.prepare_email.cleaning import clean_body
from melusine.prepare_email.cleaning import clean_header
from melusine.nlp_tools.phraser import Phraser
from melusine.nlp_tools.phraser import phraser_on_body
from melusine.nlp_tools.phraser import phraser_on_header
from melusine.nlp_tools.tokenizer import Tokenizer
from melusine.nlp_tools.embedding import Embedding
from melusine.summarizer.keywords_generator import KeywordsGenerator
from melusine.prepare_email.metadata_engineering import MetaExtension
from melusine.prepare_email.metadata_engineering import MetaDate
from melusine.prepare_email.metadata_engineering import MetaAttachmentType
from melusine.prepare_email.metadata_engineering import Dummifier
df_emails = load_email_data()
df_emails['attachment'] = df_emails['attachment'].apply(ast.literal_eval)
ManageTransferReplyTransformer = TransformerScheduler(
functions_scheduler=[
(check_mail_begin_by_transfer, None, ['is_begin_by_transfer']),
(update_info_for_transfer_mail, None, None),
(add_boolean_answer, None, ['is_answer']),
(add_boolean_transfer, None, ['is_transfer'])
]
)
df_emails = ManageTransferReplyTransformer.fit_transform(df_emails)
SegmentingTransformer = TransformerScheduler(
functions_scheduler=[
(build_historic, None, ['structured_historic']),
(structure_email, None, ['structured_body'])
]
)
df_emails = SegmentingTransformer.fit_transform(df_emails)
LastBodyHeaderCleaningTransformer = TransformerScheduler(
functions_scheduler=[
(extract_last_body, None, ['last_body']),
(clean_body, None, ['clean_body'])
]
)
df_emails = LastBodyHeaderCleaningTransformer.fit_transform(df_emails)
phraser = Phraser()
phraser.train(df_emails)
PhraserTransformer = TransformerScheduler(
functions_scheduler=[
(phraser_on_body, (phraser,), ['clean_body'])
]
)
df_emails = PhraserTransformer.fit_transform(df_emails)
tokenizer = Tokenizer(input_column="clean_body")
df_emails = tokenizer.fit_transform(df_emails)
# Pipeline to extract dummified metadata
MetadataPipeline = Pipeline([
('MetaExtension', MetaExtension()),
('MetaDate', MetaDate()),
('MetaAttachmentType', MetaAttachmentType()),
('Dummifier', Dummifier())
])
df_meta = MetadataPipeline.fit_transform(df_emails)
keywords_generator = KeywordsGenerator(n_max_keywords=4)
df_emails = keywords_generator.fit_transform(df_emails)
pretrained_embedding = Embedding(input_column='clean_body',
workers=1,
min_count=5)
print(type(pretrained_embedding))
import pandas as pd
from sklearn.preprocessing import LabelEncoder
X = pd.concat([df_emails['clean_body'],df_meta],axis=1)
y = df_emails['label']
le = LabelEncoder()
y = le.fit_transform(y)
from melusine.models.neural_architectures import cnn_model
from melusine.models.train import NeuralModel
nn_model = NeuralModel(architecture_function=cnn_model,
pretrained_embedding=pretrained_embedding,
text_input_column="clean_body",
meta_input_list=['extension', 'dayofweek','hour', 'min', 'attachment_type'],
n_epochs=10)
nn_model.fit(X,y)
y_res = nn_model.predict(X)
y_res = le.inverse_transform(y_res)
y_res
print(y_res)
print(type(pretrained_embedding)) # This should be melusine.nlp_tools.embedding.Embedding
print(type(pretrained_embedding.embedding)) # This should be gensim.models.keyedvectors.KeyedVectors
print(type(pretrained_embedding.embedding.index_to_key)) # This should be list
import numpy as np
import pandas as pd
prediction = pd.DataFrame(y_res, columns=['predictions']).to_csv('prediction.csv')
Hello @kmezhoud,
I think I found where your problem comes from.
Have you trained your embedding object ?
When you instantiate a (Melusine) Embedding
object, the attribute Embedding.embedding
is initialized at None.
(Seems to be your problem)
When you run the train
method, the Embedding.embedding
attribute is set to a gensim.models.keyedvectors.KeyedVectors
object.
Hope it helps
Hi,
I guess training the embedding before instanciating a NeuralModel should fix the bug.
I'll close the issue, feel free to re-open if needed.