GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[FEATURE] cache handling and offline parsing handling

fbougares opened this issue · comments

Is your feature request related to a problem? Please describe.
BPEmbEmbeddingsModel "deepparse/embeddings_models/bpemb_embeddings_model.py" use default "cache_dir" from BPEmb class. This is blocking when we want to overwrite the path of the "cache_dir" which is by default '~/.cache/bpemb'!!

Describe the solution you'd like
I will be better to add an "embeddings_path" param to the BPEmb class instantiation (like it is done for "FastTextEmbeddingsModel").

The BPEmbEmbeddingsModel init funciton will be for example like :

def __init__(self, embeddings_path: str, verbose: bool = True) -> None:
    super().__init__(verbose=verbose)
    with warnings.catch_warnings():
        # annoying scipy.sparcetools private module warnings removal
        # annoying boto warnings
        warnings.filterwarnings("ignore")
        model = BPEmb(lang="multi", vs=100000, dim=300, cache_dir=Path(embeddings_path))  # defaults parameters
    self.model = model

@fbougares What is the objective to change the cache dir?

@davebulaval It gives more flexibility when we want to change the path of the BPE model. I.E when using the lib in a containerized application.

@fbougares, I see! So I guess you change the CACHE_PATH variable for the embeddings model?
If so, I think adding a cache_dir argument to the class AddressParser should also be interesting. That way, it would be more straightforward to change the behaviour.

@fbougares, I have added the cache_dirargument. You can test it in the dev branch and circle back to me if it works properly.

To install the dev version: pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev

@davebulaval Thank you for help.

This is working for setting the directory for bpe models but deepparse still download the model checkpoint (bpemb.ckpt and bpemb.version) even if I have the model pre-downloaded in the cache_dir repository.

I thank that CACHE_PATH variable in tools.py needs to be updated as well.
The path is setup now as follow :

CACHE_PATH = os.path.join(os.path.expanduser("~"), ".cache", "deepparse")

Maybe it's better for having a configurable CACHE_PATH using env variable. Any thought ?

(P.S the goal is to avoid any download when using deepparse)

Thanks a lot.

Ahhhhh true I haven't thought about that. Will fix it this week.

@fbougares I've pushed a fix to handle properly the cache_path for the pretrained model weights. I completely forgot about that case. Now, it should work properly and as expected. If you can test it please.

I tested it and it works properly. Thank you for your speed and reliability.

Any idea about the date of the next release ?

@fbougares nice! Glad to help!

I have another feature I would like to include in the next release. I think I can handle the rest of the work today and release it today or next week. Will be included in release 0.7.5.

It is released. Had to push it to 0.7.6 because I'm sometimes dumb and forgot to merge dev into master before release.

@fbougares No need, I will update from this feature.

I think I will add an HTTP fail case to continue the process if a local model is present but the process is not able to verify the version. The fail case will also show a warning message to recommend verification of the newest model version. That way it will not imply changes in the behaviour for others but will handle cleanly (without another bool flag argument).

What is the HTTP error in that case?

Let me know if you have another idea for the implementation or concerns about this one.

@fbougares, I have pushed a minor fix here if you would like to test if it works as you expect https://github.com/GRAAL-Research/deepparse/tree/offline_parsing.

@fbougares No, it is not the responsibility of this function to handle the raised errors. Moreover, if I add it there, the download of model weights will also be handled differently, which it should not.

It is really latest_versionresponsability to handle it.

Is it possible for you to give me the HTTP error raised?
That may not be a 4xx or a 5xx which I infer would be the raised error range.

Here is the error :

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.version (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffd209288b0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

That was not the error I expected!

I've just pushed a new commit with a fix for it. If you could test it.

@fbougares, it is now in the dev version. I'll release it in the next release (probably in a month or so).

Feel free to reach out for more improvements and bug fixes.