`V-MoE` token droping and `MoD`

Question

`V-MoE` token droping and `MoD`

liyucheng09 opened this issue 2 months ago · comments

This token dropping method, as indicated by the citation, is based on the V-MoE method.

How this different from the recent MoD? It look like they very similar techniques.

DeepSeekDDM · Answer 1 · Tue May 14 2024 15:38:23 GMT+0800 (China Standard Time)

Our token-dropping strategy is just a token-wise dropping w.r.t. the routing probability. It is more like the token-dropping in conventional MoE models like Switch Transformer. It is totally different from MoD, so I do not understand your question. Can you give me more information about your understanding of our token-dropping strategy and MoD? Maybe we can find out something misunderstood.

Richie-yan · Answer 2 · Wed May 29 2024 19:16:29 GMT+0800 (China Standard Time)

@DeepSeekDDM @luofuli
Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension?
If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity.
If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case?
Because the paper mentions device-level token dropping, I have the above confusion.

Richie-yan · Answer 3 · Wed May 29 2024 19:37:55 GMT+0800 (China Standard Time)

Adding another question: How should I understand the statement from the paper that "we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped"? Is there a specific strategy implemented during token dropping to enforce this?
@DeepSeekDDM @luofuli

DeepSeekDDM · Answer 4 · Thu May 30 2024 10:49:24 GMT+0800 (China Standard Time)

@DeepSeekDDM @luofuli Is the capacity in the token-dropping strategy based on the expert dimension or the device dimension? If it's on the expert dimension, then the capacity is calculated as capacity = math.ceil(num_tokens * topk) / num_experts * capacity_factor, and then each expert processes its own tokens, dropping the least scored tokens if token > capacity, and padding if token < capacity. If it's on the device dimension, is the capacity calculated as capacity = math.ceil(num_tokens * topk) / num_groups * capacity_factor? How is the token dropping executed in this case? Because the paper mentions device-level token dropping, I have the above confusion.

A to Q1: Mainly on the device dimension.
A to Q2: Yes.
A to Q3: Also drop tokens with the lowest prob.
A to Q4 & Q5: Yes, we implement a specific strategy to ensure this.

Richie-yan · Answer 5 · Thu May 30 2024 19:38:05 GMT+0800 (China Standard Time)

@DeepSeekDDM 确认一下，deepseek v2 实现的是device 维度的drop token
对于 device 维度去做drop ，是将当前device 所有的expert 分数统一做个排序然后drop ？

DeepSeekDDM · Answer 6 · Thu May 30 2024 19:51:32 GMT+0800 (China Standard Time)

@DeepSeekDDM 确认一下，deepseek v2 实现的是device 维度的drop token 对于 device 维度去做drop ，是将当前device 所有的expert 分数统一做个排序然后drop ？

Yes. The actual dropping strategy is a little complex, but the main idea is what you described just now.

Richie-yan · Answer 7 · Thu May 30 2024 20:03:58 GMT+0800 (China Standard Time)

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗？比较好奇

DeepSeekDDM · Answer 8 · Fri May 31 2024 10:40:29 GMT+0800 (China Standard Time)

@DeepSeekDDM 方便大概讲一下【actual dropping strategy】吗？比较好奇

Just some additional tricks to ensure computation efficiency. It is not the key technique of DeepSeekMoE. The details will not prevent you from reproducing DeepSeekMoE.