clovaai / donut

Hi team, thanks for all the efforts on this model – it looks super promising and powerful and is a great contribution. Context on my side is that I'm been using this as a learning exercise to get to grips with the latest model architectures (aka, this issue might have some built-in naiveté!)

I've prepped a dataset for my specific use case that I'd like to train Donut on. I have close enough to 10k images to throw at it if I wish. When playing with a training set of this scale I noticed the perf when tokenizing the gt_parse data is noticeably slow – it'd cost me about 2 hours of GPU time to do the tokenizing. I traced it back to this line in utils.py

donut/donut/util.py

Line 77 in 4cfcf97

self.gt_token_sequences.append(

It calls a fairly expensive tokenizer method in transformers/tokenization_utils_base (https://github.com/huggingface/transformers/blob/c9e72f55b2dc4b9be4edb986dce0552582b328f2/src/transformers/tokenization_utils_base.py#L873) (about 50 ms per call, and for me, I need to call it about 20 times per row of training data == 1+ sec!). 99.999% of the time, this method returns that 0 special tokens were added. They were of course added from a previous row of data but it takes us 50 ms to determine this.

Since special tokens are not modified by the tokenizer (is my understanding), why do we need to use the tokenizer to test whether a given token should be added? Genuine question, conscious of my own potential ignorance for nuance here. Can we instead maintain a set of special tokens that have already been successfully tokenized and added in donut/utils and remove these from future calls to the tokenizer, crucially, skipping the call altogether when there are no new/unique special tokens for it to consider?

Happy to create a PR if this sounds sensible / I'm not missing something special that the tokenizer call is doing for <token_x> the nth time over :)

Cheers!

I think this should do the trick https://github.com/clovaai/donut/pull/268/files
In fact, I think there might have been a bug all along, in the sense that for each piece of training data read in, any json keys were (rightly) added to the decoder special tokens but a few layers down, the encoder's add_special_tokens method has a replace flag boolean with default value of True – basically this is dropping any special tokens added in a previous iteration.

By fixing this flag, we can then check (successfully!) whether the current iteration's special tokens are already in the vocab, and skip the next call if so. This saves potentially hours of calls to tokenizer's method, all of which were in vain anyway because they'd soon be overwritten :)

Ben

json2token performance