activeloopai / deeplake

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Home Page:https://activeloop.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmarked dataset iteration speed lower as expected

cgebbe opened this issue · comments

Hello!

I'm really excited about the features from deeplake (streaming directly from s3, dataset versioning and filtering). However, a preliminary benchmark showed a significantly lower dataset iteration speed compared to local file storage when iterating over (256,256,3) uint8 PNGs:

  • local dataset using tf.io loader: ~70-1000 batches/s
  • local dataset using PIL loader: ~25 batches/s
  • local dataset using deeplake dataset: ~5 batches/s

Sidenote: When iterating over the deeplake dataset, download speed was ~50MB/s. When downloading a single 3GB file from S3, average download speed was ~300MB/s.

The small benchmark code is stored here: https://github.com/cgebbe/benchmark_deeplake/tree/50621dd28a08208fe70deb07d451d01474687b54

Are these numbers to be expected? I am using the library wrong or are higher speeds only available via activeloop and not S3? I hoped that the iteration speed would be at least as fast as the local PIL loader.

Hey @cgebbe! Thanks for raising this issue. I see from the benchmarks, that you have used the tensorflow integration of deeplake for this. This is a very thin wrapper and is not optimized right now. We have 2 other dataloaders present, that can be used using ds.pytorch() and ds.dataloader() (the latter is an enterprise feature right now, built in CPP), both of these should give significantly better performance. Could you try using those and let us know if the issue persists?

thanks for the super quick answer, will try this out instantly

//EDIT: @AbhinavTuli :

deeplake_ds = deeplake.load(DEEPLAKE_PATH)
deeplake_ds.dataloader()

yields AttributeError: '<class 'deeplake.core.dataset.dataset.Dataset'>' object has no attribute 'dataloader'>.

I did install pip install deeplake[enterprise] and libdeeplake is installed. Does ds.dataloader() also work with the S3 backend?

@cgebbe Yes, it does work with s3. Seems to me that you're on some old version of Deep Lake though looking at the attribute error. Can you confirm the version?

Also, as it is an enterprise feature even when we resolve the issue with the attribute error, it won't work without a paid plan. ccing @istranic

@AbhinavTuli : Updating to the lastest deeplake version worked, thanks!

  • deeplake_ds.dataloader().tensorflow() yields AttributeError: 'DeepLakeDataLoader' object has no attribute 'tensorflow'. Is it also compatible with tensorflow?
  • the pytorch loader seems indeed faster than the tensorflow one, generating ~10 batches/s and higher download speeds. Is this the expected maximum?
  • I haven't observed a significant improvement using ds.dataloader().pytorch() over ds.pytorch(). We are currently on a paid plan (trial period). Do I need to activate something apart from pip install deeplake[enterprise]?

Current state of benchmark code including results: https://github.com/cgebbe/benchmark_deeplake/tree/c3623c7b21a2700e7105a4411b95294eab01e60b

@cgebbe Glad you got it to work. Some tips:-

  • You might want to use num_workers=0 with .dataloader.pytorch. We use C++ threads for fetching the data and python processes for transforming it. In your benchmarks, as there is no transformation, specifying num_workers won't improve the performance and will actually slow things down due to IPC.
  • In case you do specify a transform, num_workers will speed up things. Specifying decode_method for the images tensor as PIL should further speed up things and reduce IPC when num_workers are specified.
  • Currently .dataloader doesn't support tensorflow but it's one of the next steps in our roadmap

With the first tip, I believe you should get a much more significant speedup

@AbhinavTuli : Thanks again for the quick support!

  • Unfortunately, I don't see any speedup using num_workers=0 (rather a small slow-down), but the loading time for each batch is more consistent.
  • Specifying decode_method=PIL yields TypeError: default_collate: batch must contain tensors, numpy arrays, numbers, dicts or lists; found <class 'PIL.PngImagePlugin.PngImageFile'> in my case. However, I do see that it can be beneficial when performing PIL image augmentations first and then later converting to numpy array.

Could you lastly comment on

Do I need to activate something apart from pip install deeplake[enterprise]?

P.s.: I noticed that when increasing num_workers, the download traffic increases significantly (nearly proportional) while iteration speed stays roughly the same. I thought my measurement method is incorrect, but at least the print statements also correspond to the measurements made by bmon. Do you have any insight into this observation?

@cgebbe
Is your benchmark repo updated? There are a few things that I see in the repo that seem unusual.

  • .dataloader().pytorch() doesn't support pin_memory=True which seems to be something you're passing.
  • You're using local and s3 paths with ds.dataloader() which aren't supported, only "hub://" paths are supported. If you're using deeplake 3.1.1 or later, and the dataset is in S3, you should have had to connect the dataset to Deep Lake before being able to use ds.dataloader(). Did you already do that, or did it work out-of-the box, without connecting the dataset? If it worked without connecting, there might be an issue on our end, but it won't affect your benchmarks.
  • decode_method should be a dictionary, something like {"image":"pil"}

I have a suspicion that somehow only ds.pytorch is being tested in your benchmarks. I believe that might also be the reason for the unusual download traffic, it's a known issue in ds.pytorch as it downloads chunks multiple times across workers without making range requests.

You don't need to activate anything else.

You are completely right, it never used the C++ dataloader due to a dumm enum mistake, my bad. After fixing this bug, connecting the dataset to deeplake and removing all pytorch arguments, I now get the following error:

hub://.../pngs loaded successfully.
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/.../pngs
Segmentation fault (core dumped)

Maybe we can discuss this also later in the chat. Current code is: https://github.com/cgebbe/benchmark_deeplake/tree/61ece41c1bf90dc366098947b6bb4153890dec6b

I'll follow up the discussion here so that others can see it, too.

python3 -m pip uninstall libdeeplake; python3 -m pip install libdeeplake==0.0.32 fixed the segmentation fault issue, thanks a lot!

As promised, the optimized dataloader is slightly faster than the tensorflow dataloader using PIL:

  • using PIL: ~15-25 batches/s
  • using deeplakes optimized dataloader with torch on a r6i.xlarge instance: ~20 batches/s (at ~150MB/s)
  • using deeplakes optimized dataloader with torch on a p3.16xlarge instance: ~30 batches/s (at ~250MB/s)

@AbhinavTuli : I believe in the discussion you mentioned that you still achieve significantly higher download speeds, is this correct?

Next steps for us are to...

  • benchmark example dataset using local tfrecords files
  • run an actual training on realistic data and monitor GPU utilization. For this, we likely need to wait until the C++ loader supports tensorflow. Thanks for the support again!

Current code: https://github.com/cgebbe/benchmark_deeplake/blob/8543d1eabdb0e6c0bebd7a4700e7f5c88555c04f/README.md