Try using `polars` for fasting data loading or even use Hugging Face Datasets
mrdbourke opened this issue · comments
Right now the data_loader.py
script works with a pandas DataFrame.
This is fine with ~25,000 images but perhaps not ideal for larger datasets.
Alternatives are:
polars
, a DataFrame-like library built in Rust (very fast) - https://pola-rs.github.io/polars/py-polars/html/index.html- Hugging Face Datasets, also very fast apparently, see the docs - https://huggingface.co/docs/datasets/index
Can also test the speed of data loading using:
- PyTorch profiler - https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
- PyTorch trace analysis - https://pytorch.org/blog/trace-analysis-for-masses/