mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

Home Page:https://streaming.docs.mosaicml.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reuse S3 session

wouterzwerink opened this issue Β· comments

πŸš€ Feature Request

Currently when I use S3 with an IAM role, I see StreamingDataset fetch new credentials for every shard:
image
There is a never ending stream of credential logs after this

That's quite inefficient, getting credentials from IAM roles is not that fast. Would be nicer to reuse credentials until they expire

Motivation

Faster is better!

[Optional] Implementation

I think it would work to just reuse the S3 Session object per thread

Additional context

Hey! If it's not too much of a hassle, mind submitting a PR with your proposed change? I'd be happy to review

Hey! If it's not too much of a hassle, mind submitting a PR with your proposed change? I'd be happy to review

Sure! I made a fix for this that worked earlier, but will need to clean it up a bit before submitting. Will take a look somewhere next week

Perfect, thank you @wouterzwerink! Feel free to tag me when the PR is up.

@wouterzwerink Hey, just wanted to follow up on this, mind submitting a quick PR if/when you have some time? Thanks!!

I am interested in this issue (actually we need it for potential performance improvement). I think the problem is in which level we want to keep a boto3 seesion. Maybe keep one seesion for each stream? If so, I suppose to create an s3 client in stream and reuse it when trigger download_file() in Stream._download_file(). Any comments?