Drop Token

Question

Drop Token

Richie-yan opened this issue 2 months ago · comments

hello @DeepSeekDDM @luofuli
I have some question in drop token deepseek v2
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

Richie-yan · Answer 1 · Thu May 30 2024 09:53:41 GMT+0800 (China Standard Time)

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this?
@DeepSeekDDM @luofuli

DeepSeekDDM · Answer 2 · Thu May 30 2024 19:10:51 GMT+0800 (China Standard Time)

Refer to this issue: #5