Quickstart documentation doesn't describe setting KAZU_MODEL_PACK

Question

Quickstart documentation doesn't describe setting KAZU_MODEL_PACK

Kik099 opened this issue 10 months ago · comments

I am having this error and i followed the steps in here: https://astrazeneca.github.io/KAZU/quickstart.html

Cell In[15], line 11
8 import os
10 # the hydra config is kept in the model pack
---> 11 cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath("conf")
14 @hydra.main(
15 version_base=HYDRA_VERSION_BASE, config_path=str(cdir), config_name="config"
16 )
17 def kazu_test(cfg):
18 pipeline: Pipeline = instantiate(cfg.Pipeline)

File :679, in getitem(self, key)

KeyError: 'KAZU_MODEL_PACK'

Elliot Ford · Answer 1 · Wed Oct 18 2023 16:36:52 GMT+0800 (China Standard Time)

Hi,

Thanks for trying KAZU, and for raising this issue.

Yes, we haven't been clear here, our apologies - you need to have the path to your downloaded Model Pack set as the environment variable KAZU_MODEL_PACK when you run KAZU. For example, on MacOS, Linux or within Windows Subsystem for Linux:

export KAZU_MODEL_PACK=/Users/me/some/path/to/where/I/downloaded/kazu_model_pack_public-v1.1.2

For Windows, if you're using the standard CMD, it will be:

set KAZU_MODEL_PACK=C:\Users\me\some\path\to\where\I\downloaded\kazu_model_pack_public-v1.1.2

and for Windows PowerShell it's:

$Env:KAZU_MODEL_PACK = 'KAZU_MODEL_PACK=C:\Users\me\some\path\to\where\I\downloaded\kazu_model_pack_public-v1.1.2'

Thanks you for letting us know about this, we will update the documentation so that this is clear. Please me us know if you are able to get it working!

Rodrigo Saraiva · Answer 2 · Wed Oct 18 2023 18:12:26 GMT+0800 (China Standard Time)

Thank you so much, I just have one final question. How can I run on the jupyter notebook, due to the fact that the kernel is enabled

Elliot Ford · Answer 3 · Wed Oct 18 2023 18:15:22 GMT+0800 (China Standard Time)

you mean in terms of making sure the environment variable is set within the notebook? Not sure if I'm understanding you right.

If that's what you mean, you can either:

Start the Jupyter notebook from the command line, in a shell where you've already set KAZU_MODEL_PACK to an appropriate value
Set KAZU_MODEL_PACK in the notebook using os.environ["KAZU_MODEL_PACK"] = "/your/model/pack/path" in a cell early on, before you try to load the Kazu config or pipeline

Rodrigo Saraiva · Answer 4 · Wed Oct 18 2023 18:17:36 GMT+0800 (China Standard Time)

i did that, but then this happens

UnicodeDecodeError Traceback (most recent call last)
File ~\anaconda3\envs\Tese\Lib\site-packages\hydra_internal\instantiate_instantiate2.py:92, in _call_target(target, partial, args, kwargs, full_key)
91 try:
---> 92 return target(*args, **kwargs)
93 except Exception as e:

File ~\anaconda3\envs\Tese\Lib\site-packages\kazu\steps\joint_ner_and_linking\memory_efficient_string_matching.py:36, in MemoryEfficientStringMatchingStep.init(self, parsers)
35 def init(self, parsers: Iterable[OntologyParser]):
---> 36 super().init(parsers)
37 self.parsers = parsers

File ~\anaconda3\envs\Tese\Lib\site-packages\kazu\steps\step.py:52, in ParserDependentStep.init(self, parsers)
51 for parser in parsers:
---> 52 parser.populate_databases(force=False)

File ~\anaconda3\envs\Tese\Lib\site-packages\kazu\ontology_preprocessing\base.py:1189, in OntologyParser.populate_databases(self, force, return_curations)
1187 kazu_disk_cache.delete(cache_key)
-> 1189 maybe_curations, metadata, final_syn_terms = self._populate_databases(self.name)
1191 if self.name not in self.synonym_db.loaded_parsers:

File ~\anaconda3\envs\Tese\Lib\site-packages\diskcache\core.py:1875, in Cache.memoize..decorator..wrapper(*args, **kwargs)
1874 if result is ENOVAL:
-> 1875 result = func(*args, **kwargs)
1876 if expire is None or expire > 0:

File ~\anaconda3\envs\Tese\Lib\site-packages\kazu\ontology_preprocessing\base.py:1156, in OntologyParser._populate_databases(self, parser_name)
1155 logger.info("populating database for %s from source", self.name)
-> 1156 metadata = self.export_metadata(self.name)
1157 # metadata db needs to be populated before call to export_synonym_terms

....

I am getting more errors

Rodrigo Saraiva · Answer 5 · Wed Oct 18 2023 18:19:28 GMT+0800 (China Standard Time)

If i run in mac it works but gives me this output:

...

[2023-10-18 11:12:58,928][kazu.pipeline.pipeline][INFO] - profiling not configured
/usr/local/Caskroom/miniconda/base/envs/kazu/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, predict_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 4 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. rank_zero_warn( /usr/local/Caskroom/miniconda/base/envs/kazu/lib/python3.11/site-packages/pytorch_lightning/loops/epoch/prediction_epoch_loop.py:173: UserWarning: Lightning couldn't infer the indices fetched for your dataloader. warning_cache.warn("Lightning couldn't infer the indices fetched for your dataloader.") You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepad` method to get a padded encoding.
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
EGFR mutations are often implicated in lung cancer

Not the same as in this website: kazu.korea.ac.kr

Elliot Ford · Answer 6 · Wed Oct 18 2023 18:19:33 GMT+0800 (China Standard Time)

That looks like some issue with a unicode character in one of the files in the model pack. Can I check what version of Python are you using? Kazu expects Python 3.8+

Rodrigo Saraiva · Answer 7 · Wed Oct 18 2023 18:20:02 GMT+0800 (China Standard Time)

First time on windows, the second message on mac

Rodrigo Saraiva · Answer 8 · Wed Oct 18 2023 18:21:02 GMT+0800 (China Standard Time)

this one is on macos:

...

[2023-10-18 11:12:58,928][kazu.pipeline.pipeline][INFO] - profiling not configured
/usr/local/Caskroom/miniconda/base/envs/kazu/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, predict_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argument(try 4 which is the number of cpus on this machine) in theDataLoaderinit to improve performance. rank_zero_warn( /usr/local/Caskroom/miniconda/base/envs/kazu/lib/python3.11/site-packages/pytorch_lightning/loops/epoch/prediction_epoch_loop.py:173: UserWarning: Lightning couldn't infer the indices fetched for your dataloader. warning_cache.warn("Lightning couldn't infer the indices fetched for your dataloader.") You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepad` method to get a padded encoding.
Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
EGFR mutations are often implicated in lung cancer

The other was in windows using the jupyter notebook

Rodrigo Saraiva · Answer 9 · Wed Oct 18 2023 18:22:55 GMT+0800 (China Standard Time)

In mac it looks that it is working, but how i would get the details like it is in this page: : kazu.korea.ac.kr

Elliot Ford · Answer 10 · Wed Oct 18 2023 18:28:20 GMT+0800 (China Standard Time)

You mean you want it as the highlighted html?

To do that you would need to set up the frontend and run the server - you need to prepare the frontend as described here: https://github.com/AstraZeneca/KAZU/blob/main/frontend/README.md#running-locally-with-ray

and then run the Kazu server with:

python -m kazu.web.server --config-path "${KAZU_MODEL_PACK}/conf/" hydra.run.dir="." ray=local_with_frontend +ray.init.num_cpus=1 +ray.init.object_store_memory=1000000000

This is an area we could improve our docs considerably on - we expect most users of Kazu to want to use it as a python library, so haven't focused on running the frontend in our docs.

Rodrigo Saraiva · Answer 11 · Wed Oct 18 2023 18:30:34 GMT+0800 (China Standard Time)

i wanted to get this response:

In the code, and thanks for the help

Elliot Ford · Answer 12 · Wed Oct 18 2023 18:34:18 GMT+0800 (China Standard Time)

Ah I see, sorry for my misunderstanding. If you do:

minified_doc = doc.as_minified_dict()

you will get a document with that same structure (although that example response has a list of them, so to get exactly the same structure it would be minified_docs = [doc.as_minified_dict()] or minified_docs = [d.as_minified_dict() for d in [doc]]

Rodrigo Saraiva · Answer 13 · Wed Oct 18 2023 18:36:38 GMT+0800 (China Standard Time)

like this then:

`import hydra
from hydra.utils import instantiate

from kazu.data.data import Document
from kazu.pipeline import Pipeline
from kazu.utils.constants import HYDRA_VERSION_BASE
from pathlib import Path
import os

cdir = Path(os.environ["KAZU_MODEL_PACK"]).joinpath("conf")

@hydra.main(
version_base=HYDRA_VERSION_BASE, config_path=str(cdir), config_name="config"
)
def kazu_test(cfg):
pipeline: Pipeline = instantiate(cfg.Pipeline)
text = "EGFR mutations are often implicated in lung cancer"
doc = Document.create_simple_document(text)
pipeline([doc])
print(f"{doc.sections[0].text}")

##new line added
minified_doc = doc.as_minified_dict()

if name == "main":
kazu_test()`

Rodrigo Saraiva · Answer 14 · Wed Oct 18 2023 18:43:12 GMT+0800 (China Standard Time)

It worked, thanks a lot

Elliot Ford · Answer 15 · Wed Oct 18 2023 18:45:03 GMT+0800 (China Standard Time)

Great! I'll leave this issue open for the improvements to the docs to describe setting KAZU_MODEL_PACK in the docs. And I'll make a separate issue for the Unicode encoding error on Windows - that may take me a little me time to look into as my work laptop is a Mac (though I have a personal Windows machine so should be able to test there).

Elliot Ford · Answer 16 · Thu Oct 19 2023 20:15:52 GMT+0800 (China Standard Time)

The quickstart has been updated now to explicitly explain that the KAZU_MODEL_PACK variable needs setting, and how to do so: https://astrazeneca.github.io/KAZU/quickstart.html#model-pack

So I'll close this issue, but we've still got the issue for the error you got on Windows open, so that's recorded.