[FEATURE] cache handling and offline parsing handling

Question

[FEATURE] cache handling and offline parsing handling

fbougares opened this issue 2 years ago · comments

Is your feature request related to a problem? Please describe.
BPEmbEmbeddingsModel "deepparse/embeddings_models/bpemb_embeddings_model.py" use default "cache_dir" from BPEmb class. This is blocking when we want to overwrite the path of the "cache_dir" which is by default '~/.cache/bpemb'!!

Describe the solution you'd like
I will be better to add an "embeddings_path" param to the BPEmb class instantiation (like it is done for "FastTextEmbeddingsModel").

The BPEmbEmbeddingsModel init funciton will be for example like :

def __init__(self, embeddings_path: str, verbose: bool = True) -> None:
    super().__init__(verbose=verbose)
    with warnings.catch_warnings():
        # annoying scipy.sparcetools private module warnings removal
        # annoying boto warnings
        warnings.filterwarnings("ignore")
        model = BPEmb(lang="multi", vs=100000, dim=300, cache_dir=Path(embeddings_path))  # defaults parameters
    self.model = model

David Beauchemin · Answer 1 · Tue May 31 2022 05:50:55 GMT+0800 (China Standard Time)

@fbougares What is the objective to change the cache dir?

fethi Bougares · Answer 2 · Fri Jun 03 2022 15:35:29 GMT+0800 (China Standard Time)

@davebulaval It gives more flexibility when we want to change the path of the BPE model. I.E when using the lib in a containerized application.

David Beauchemin · Answer 3 · Fri Jun 03 2022 19:31:32 GMT+0800 (China Standard Time)

@fbougares, I see! So I guess you change the CACHE_PATH variable for the embeddings model?
If so, I think adding a cache_dir argument to the class AddressParser should also be interesting. That way, it would be more straightforward to change the behaviour.

David Beauchemin · Answer 4 · Fri Jun 03 2022 22:25:26 GMT+0800 (China Standard Time)

@fbougares, I have added the cache_dirargument. You can test it in the dev branch and circle back to me if it works properly.

To install the dev version: pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev

fethi Bougares · Answer 5 · Tue Jun 07 2022 22:55:38 GMT+0800 (China Standard Time)

@davebulaval Thank you for help.

This is working for setting the directory for bpe models but deepparse still download the model checkpoint (bpemb.ckpt and bpemb.version) even if I have the model pre-downloaded in the cache_dir repository.

I thank that CACHE_PATH variable in tools.py needs to be updated as well.
The path is setup now as follow :

CACHE_PATH = os.path.join(os.path.expanduser("~"), ".cache", "deepparse")

Maybe it's better for having a configurable CACHE_PATH using env variable. Any thought ?

(P.S the goal is to avoid any download when using deepparse)

Thanks a lot.

David Beauchemin · Answer 6 · Wed Jun 08 2022 01:00:18 GMT+0800 (China Standard Time)

Ahhhhh true I haven't thought about that. Will fix it this week.

David Beauchemin · Answer 7 · Thu Jun 09 2022 00:01:07 GMT+0800 (China Standard Time)

@fbougares I've pushed a fix to handle properly the cache_path for the pretrained model weights. I completely forgot about that case. Now, it should work properly and as expected. If you can test it please.

fethi Bougares · Answer 8 · Thu Jun 09 2022 16:22:14 GMT+0800 (China Standard Time)

I tested it and it works properly. Thank you for your speed and reliability.

Any idea about the date of the next release ?

David Beauchemin · Answer 9 · Thu Jun 09 2022 20:53:09 GMT+0800 (China Standard Time)

@fbougares nice! Glad to help!

I have another feature I would like to include in the next release. I think I can handle the rest of the work today and release it today or next week. Will be included in release 0.7.5.

David Beauchemin · Answer 10 · Fri Jun 10 2022 00:34:49 GMT+0800 (China Standard Time)

It is released. Had to push it to 0.7.6 because I'm sometimes dumb and forgot to merge dev into master before release.

fethi Bougares · Answer 11 · Tue Jun 21 2022 21:29:39 GMT+0800 (China Standard Time)

Hello David, I am facing a new issue with the deepparse library because of the default downloading of the last version from "https://graal.ift.ulaval.ca/public/deepparse" As I explained in the previous issue I am running deepparse inside a container without internet access. Inside the code you try to fetch the last version (file .version) and check if there is a new version of the bpe model. As a solution I will change the code locally to avoid this by default behavior but I wanted to know if it’s worth to open an issue related to this ? Best regards, Fethi.

…

On 9 Jun 2022, at 17:35, David Beauchemin ***@***.***> wrote: It is released. Had to push it to 0.7.6 because I'm sometimes dumb and forgot to merge dev into master before release. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA576PNBZDRN4Y26FJ73CMDVOIMLLANCNFSM5XK5Y7RQ>. You are receiving this because you were mentioned.

David Beauchemin · Answer 12 · Tue Jun 21 2022 22:25:14 GMT+0800 (China Standard Time)

@fbougares No need, I will update from this feature.

I think I will add an HTTP fail case to continue the process if a local model is present but the process is not able to verify the version. The fail case will also show a warning message to recommend verification of the newest model version. That way it will not imply changes in the behaviour for others but will handle cleanly (without another bool flag argument).

What is the HTTP error in that case?

Let me know if you have another idea for the implementation or concerns about this one.

David Beauchemin · Answer 13 · Tue Jun 21 2022 23:11:27 GMT+0800 (China Standard Time)

@fbougares, I have pushed a minor fix here if you would like to test if it works as you expect https://github.com/GRAAL-Research/deepparse/tree/offline_parsing.

fethi Bougares · Answer 14 · Wed Jun 22 2022 01:03:41 GMT+0800 (China Standard Time)

Not working, still some error related to download : I think the Try/except block should be inside the download_from_url function around this call : r = requests.get(url, timeout=5) Regards,

…

On 21 Jun 2022, at 16:11, David Beauchemin ***@***.***> wrote: @fbougares <https://github.com/fbougares>, I have pushed a minor fix here if you would like to test if it works as you expect https://github.com/GRAAL-Research/deepparse/tree/offline_parsing <https://github.com/GRAAL-Research/deepparse/tree/offline_parsing>. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA576POL4MZON4BPUMKJJB3VQHLSVANCNFSM5XK5Y7RQ>. You are receiving this because you were mentioned.

David Beauchemin · Answer 15 · Wed Jun 22 2022 17:52:12 GMT+0800 (China Standard Time)

@fbougares No, it is not the responsibility of this function to handle the raised errors. Moreover, if I add it there, the download of model weights will also be handled differently, which it should not.

It is really latest_versionresponsability to handle it.

Is it possible for you to give me the HTTP error raised?
That may not be a 4xx or a 5xx which I infer would be the raised error range.

fethi Bougares · Answer 16 · Wed Jun 22 2022 18:03:53 GMT+0800 (China Standard Time)

Here is the error :

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='graal.ift.ulaval.ca', port=443): Max retries exceeded with url: /public/deepparse/bpemb.version (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ffd209288b0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

David Beauchemin · Answer 17 · Wed Jun 22 2022 18:34:48 GMT+0800 (China Standard Time)

That was not the error I expected!

I've just pushed a new commit with a fix for it. If you could test it.

David Beauchemin · Answer 18 · Fri Jun 24 2022 00:34:50 GMT+0800 (China Standard Time)

@fbougares, it is now in the dev version. I'll release it in the next release (probably in a month or so).

Feel free to reach out for more improvements and bug fixes.

fethi Bougares · Answer 19 · Tue Oct 11 2022 16:04:29 GMT+0800 (China Standard Time)

Thank you for the quick reply: What you proposed is good. If the model exist, try to get the latest version but if there is a fail to get it show a warning and continue with the existing model. Regards,

…

On 21 Jun 2022, at 15:25, David Beauchemin ***@***.***> wrote: @fbougares <https://github.com/fbougares> No need, I will update from this feature. I think I will add an HTTP fail case to continue the process if a local model is present but the process is not able to verify the version. The fail case will also show a warning message to recommend verification of the newest model version. That way it will not imply changes in the behaviour for others but will handle cleanly (without another bool flag argument). Let me know if you have another idea for the implementation or concerns about this one. — Reply to this email directly, view it on GitHub <#135 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA576PMAXD5Y4IXFPO7QHOTVQHGFNANCNFSM5XK5Y7RQ>. You are receiving this because you were mentioned.