facebookresearch / audiocraft

I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.

The recent publishing of MMD had this figure

which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.
There is no linear layer after the cross attention block that I can see in the code.
The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect?
Thanks!

I think the EnCodec-MMD is still on this branch: https://github.com/jmlemercier/audiocraft/blob/encodec-mmd/docs/MMD.md
Look at this stack trace:
- audiocraft/audiocraft/models/lm.py
  
  Line 257 in adf0b04
  
  out = self.transformer(input_, cross_attention_src=cross_attention_input,
- audiocraft/audiocraft/modules/transformer.py
  
  Line 708 in adf0b04
  
  x = self._apply_layer(layer, x, *args, **kwargs)
- audiocraft/audiocraft/modules/transformer.py
  
  Line 565 in adf0b04
  
  x = x + self.layer_scale_2(self._ff_block(self.norm2(x)))
  
  The "Linear" in the diagram probably refers to the _ff_block (feed-forward block)
I have a similar question about the small model here: #169 (comment)

If I may have a similar question about the architecture:

I've noticed that the papers state that conditioning is done through cross-attention wrt the conditioning text, but when going through the code (for the melody variant) it seems that the text is just prepended when generating the first token instead. For the small variant, the description is indeed conditioning through cross attention.

From what I understand, autoregressive models depend on z (0 -> t-1) to generate z_t, however in this implementation only z_t-1 is passed to the LM when generating the tokens:

audiocraft/audiocraft/models/lm.py

Lines 502 to 512 in adf0b04

    
           curr_sequence = gen_sequence[..., prev_offset:offset] 
        
           curr_mask = mask[None, ..., prev_offset:offset].expand(B, -1, -1) 
        
           if check: 
        
               # check coherence between mask and sequence 
        
               assert (curr_sequence == torch.where(curr_mask, curr_sequence, self.special_token_id)).all() 
        
               # should never happen as gen_sequence is filled progressively 
        
               assert not (curr_sequence == unknown_token).any() 
        
           # sample next token from the model, next token shape is [B, K, 1] 
        
           next_token = self._sample_next_token( 
        
               curr_sequence, cfg_conditions, unconditional_state, use_sampling, temp, top_k, top_p, 
        
               cfg_coef=cfg_coef, two_step_cfg=two_step_cfg)

Where prev_offset:offset is always of len == 1. Is this because of what authors claim in the original MusicGen paper, i.e. modelling different codebooks as conditionally independent (and later on trying to improve on in the MusicGen-MMD)? Or am I not getting this architecture right? The only difference between consecutive token generation I can see is the positional embedding change from one token to another (and obviously the codebook indices/embeddings corresponding to the previous token). In other words if, let's say, the 13th token is [A, B, C, D], it seems to me that the 14th token probability distribution will be always the same, irrespective of what tokens 1 through 12 were!

Any clarification about the generation process would be highly appreciated!

	curr_sequence = gen_sequence[..., prev_offset:offset]
	curr_mask = mask[None, ..., prev_offset:offset].expand(B, -1, -1)
	if check:
	# check coherence between mask and sequence
	assert (curr_sequence == torch.where(curr_mask, curr_sequence, self.special_token_id)).all()
	# should never happen as gen_sequence is filled progressively
	assert not (curr_sequence == unknown_token).any()
	# sample next token from the model, next token shape is [B, K, 1]
	next_token = self._sample_next_token(
	curr_sequence, cfg_conditions, unconditional_state, use_sampling, temp, top_k, top_p,
	cfg_coef=cfg_coef, two_step_cfg=two_step_cfg)

Ambiguity in MusicGen architecture