GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Home Page:https://deepparse.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Model Deployement in Sagemaker

Gayathri2993 opened this issue · comments

What is the best way to deploy deepparse in AWS sagemaker. I have the downloaded model (fasttext) directly and stored it in S3 since i do not want to download the model everytime i run l. I have given the S3 path as path_to_retrained_model. But somehow the the function is not able to read the path file from S3. The error says no such file exists. I am very positive that the file path is correct and model exists in S3. Is there something I am missing or could you please let me know if there is an efficient way to deploy the model. I have millions of records to run this model on.

  1. If i were to retrain the model for another country in future, i will have to save the model. What is the best way to deploy the model in this use case?

Thank you for you interest in improving Deepparse.

About an URL as a path, it is not a feature, but I can definitely see it. I will look at possible solutions to integrate such a feature with an S3 bucket.

If you retrain a model, I think using the same S3 bucket approach is interesting. If the feature is developed, it will be easy for you. All you will need is to save it after training. I guess botos3 with a predefined pattern URL would allow you to export to S3 and later update a .txt file with the latest URL. Or something like that. (Here a possible approach)

I found a library that can help handle for AWS S3 bucket. I have implemented a feature and only need to test if it works properly. I will try to finish it today but can promise it.

You can test the feature using this dev branch:

pip install -U git+https://github.com/GRAAL-Research/deepparse.git@aws_s3_uri.

I also think it would be possible to use CloudPathLib to save a model when directly retraining into an S3-like bucket.
I will like into it after your feedback on the prototype and when I will have more time to test it.

Hi Dave,

Thanks for the quick prompt response. I did try the feature you developed using CloudPathLib but i ended up with an error. Request you to please help me with this.

Code:
!pip install cloudpathlib
from cloudpathlib import CloudPath

address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model=CloudPath("s3://s3_path/fasttext.ckpt"))

address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

Error:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/deepparse/parser/address_parser.py:240, in AddressParser.init(self, model_type, attention_mechanism, device, rounding, verbose, path_to_retrained_model, cache_dir, offline)
237 seq2seq_kwargs = {} # Empty for default settings
239 if path_to_retrained_model is not None:
--> 240 if "s3://" in path_to_retrained_model:
241 if CloudPath is None:
242 raise ImportError(
243 "cloudpathlib needs to be installed to use a S3-like " "URI as path_to_retrained_model."
244 )

TypeError: argument of type 'S3Path' is not iterable

I'm on vacation for the next two weeks. I will take a look at it in mid-may.

cloudpathlib maintainer here. @Gayathri2993, it may be the case that you need to additionally install the S3 dependencies, so instead of !pip install cloudpathlib it should be !pip install cloudpathlib[s3]

@pjbull tks for the tips. I will take a look at this.

@Gayathri2993 Try this:

address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model="s3://s3_path/fasttext.ckpt")

The path does not need to be a CloudPath; we handle the conversion to it in the code base. I have added a catch for this kind of behaviour.

And I also added support to allow the use of a CloudPath directly as the argument instead of a string.

LMK if it works after updating using the branch. I've pushed some modifications.

Thank you guys for looking into this issue. I tried two methods to load the model directly from S3. Please find the details of two methods below

Method 1:

Code:
!pip install cloudpathlib[s3]
from cloudpathlib import CloudPath

address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model=CloudPath("s3://s3_path/fasttext.ckpt"))

address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/deepparse/parser/address_parser.py:225, in AddressParser.init(self, model_type, attention_mechanism, device, rounding, verbose, path_to_retrained_model, cache_dir, offline)
222 seq2seq_kwargs = {} # Empty for default settings
224 if path_to_retrained_model is not None:
--> 225 checkpoint_weights = torch.load(path_to_retrained_model, map_location="cpu")
226 if checkpoint_weights.get("model_type") is None:
227 # Validate if we have the proper metadata, it has at least the parser model type
228 # if no other thing have been modified.
229 raise RuntimeError(
230 "You are not using the proper retrained checkpoint. "
231 "When we retrain an AddressParser, by default, we create a "
(...)
234 "See AddressParser.retrain for more details."
235 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:791, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
788 if 'encoding' not in pickle_load_args.keys():
789 pickle_load_args['encoding'] = 'utf-8'
--> 791 with _open_file_like(f, 'rb') as opened_file:
792 if _is_zipfile(opened_file):
793 # The zipfile reader is going to advance the current file position.
794 # If we want to actually tail call to torch.jit.load, we need to
795 # reset back to the original position.
796 orig_position = opened_file.tell()

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:276, in _open_file_like(name_or_buffer, mode)
274 return _open_buffer_writer(name_or_buffer)
275 elif 'r' in mode:
--> 276 return _open_buffer_reader(name_or_buffer)
277 else:
278 raise RuntimeError(f"Expected 'r' or 'w' in mode but got {mode}")

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:261, in _open_buffer_reader.init(self, buffer)
259 def init(self, buffer):
260 super().init(buffer)
--> 261 _check_seekable(buffer)

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:357, in _check_seekable(f)
355 return True
356 except (io.UnsupportedOperation, AttributeError) as e:
--> 357 raise_err_msg(["seek", "tell"], e)
358 return False

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:350, in _check_seekable..raise_err_msg(patterns, e)
346 if p in str(e):
347 msg = (str(e) + ". You can only torch.load from a file that is seekable."
348 + " Please pre-load the data into a buffer like io.BytesIO and"
349 + " try to load from it instead.")
--> 350 raise type(e)(msg)
351 raise e

AttributeError: 'S3Path' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

Method 2: I tried the below method as well. But i got 'no file exists error'. I am quiet certain that the model exists in S3 and file path is correct as well.

address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model="s3://s3_path/fasttext.ckpt")

Error:
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:791, in load(f, map_location, pickle_module, weights_only, **pickle_load_args)
788 if 'encoding' not in pickle_load_args.keys():
789 pickle_load_args['encoding'] = 'utf-8'
--> 791 with _open_file_like(f, 'rb') as opened_file:
792 if _is_zipfile(opened_file):
793 # The zipfile reader is going to advance the current file position.
794 # If we want to actually tail call to torch.jit.load, we need to
795 # reset back to the original position.
796 orig_position = opened_file.tell()

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:271, in _open_file_like(name_or_buffer, mode)
269 def _open_file_like(name_or_buffer, mode):
270 if _is_path(name_or_buffer):
--> 271 return _open_file(name_or_buffer, mode)
272 else:
273 if 'w' in mode:

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:252, in _open_file.init(self, name, mode)
251 def init(self, name, mode):
--> 252 super().init(open(name, mode))

FileNotFoundError: [Errno 2] No such file or directory: 's3://<s3_path>/fasttext.ckpt'

I also tried loading the model directly from the library in my sagemaker instance. But i get duplicated results everytime i try to pass multiple addresses. Surprisingly the same code works as expected in sagemaker studio lab (free ML environment platform). Here's a snippet from sagemaker instance. Can you please let me know why am i getting duplicated results here.
Screenshot 2023-05-08 at 9 39 20 am

I will take a look next week; I have a busy week.

Uhmm, strange behaviour for the duplicate.

Can I get a restricted secret key to the S3 bucket? It will be easier for me to test. You can forward it to me using my email david.beauchemin@ift.ulaval.ca.

Dave, I am not sure if i can share restricted secret key because I am using the S3 for official purpose. Is there any other way to test this?

@Gayathri2993 No problem, it was more that I didn't want to take the time to set up all that in my personal account.
Right now, I plan to look at it next Thursday since I have other deadlines the following week.

@davebulaval It would be super helpful if you could prioritise the duplicate records issue whenever you start working on this. Since I have millions of records to run, I would like to run deeparse directly in AWS SageMaker.

Hi @davebulaval. I just wanted to kindly check in and see if there have been any updates regarding the issues we discussed. I understand that you have other deadlines this week, and I appreciate your time. When it's convenient for you, I would appreciate an update on the progress or any further steps that need to be taken.

Do the following to see if it work now.

  1. Update to the newest branch version using pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev.
  2. Test using something like:
uri = "s3://<path>/fasttext.ckpt"

address_parser = AddressParser(model_type="fasttext", path_to_retrained_model=uri)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

or

uri = CloudPath("s3://deepparse/fasttext.ckpt")

address_parser = AddressParser(model_type="fasttext", path_to_retrained_model=uri)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")

, we now support both approaches.

On my side, I was able to use a URI to download a model to parse and also to upload a model after retraining to an URI directory bucket using S3.

I also tried loading the model directly from the library in my sagemaker instance. But i get duplicated results everytime i try to pass multiple addresses. Surprisingly the same code works as expected in sagemaker studio lab (free ML environment platform). Here's a snippet from sagemaker instance. Can you please let me know why am i getting duplicated results here. Screenshot 2023-05-08 at 9 39 20 am

For this, I have no idea how to investigate that and the reason for this behaviour.

I found the problem. It is fixed in the branch 9943f04 commit.

I have merged all the content into dev since the content seems to be ready for a release.

@davebulaval Thankyou so much for your efforts. Both the issues are resolved. The code is working in aws sagemaker and I am not getting duplicate records as well.

I just had a quick question. Do you have any idea on maximum number of records it can parse at a one go?

Great!

If you use a GPU, it depends on your GPU memory size. From our experimentation with batch processing in Deepparse, the optimal size is around 256. Beyond that, it can take longer per address (due to IO) and data preparation. To alleviate that (we have not tested with more workers), you can increase the `num_workers' to something like 4. Too much is not better.

If you do not use a GPU, it depends on the number of CPU and RAM.

The faster is with a GPU, of course.

Release in 0.9.7 #195