facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ambiguity in MusicGen architecture

rtavasso1 opened this issue · comments

I have two 3 discrepancies between what is described in the paper versus what I see in code/blog posts.

  1. The recent publishing of MMD had this figure
    image
    which shows a concatenation operation between the audio embeddings and the output of the cross attention. I cannot find this operation in the code for the LM.

  2. There is no linear layer after the cross attention block that I can see in the code.

  3. The config for the small model calls for 24 layers, dim 1024, 16 heads, which when initialized, is ~420m parameters. Is the config incorrrect?
    Thanks!

  1. I think the EnCodec-MMD is still on this branch: https://github.com/jmlemercier/audiocraft/blob/encodec-mmd/docs/MMD.md
  2. Look at this stack trace:
  3. I have a similar question about the small model here: #169 (comment)

If I may have a similar question about the architecture:

  • I've noticed that the papers state that conditioning is done through cross-attention wrt the conditioning text, but when going through the code (for the melody variant) it seems that the text is just prepended when generating the first token instead. For the small variant, the description is indeed conditioning through cross attention.
  • From what I understand, autoregressive models depend on z (0 -> t-1) to generate z_t, however in this implementation only z_t-1 is passed to the LM when generating the tokens:
    curr_sequence = gen_sequence[..., prev_offset:offset]
    curr_mask = mask[None, ..., prev_offset:offset].expand(B, -1, -1)
    if check:
    # check coherence between mask and sequence
    assert (curr_sequence == torch.where(curr_mask, curr_sequence, self.special_token_id)).all()
    # should never happen as gen_sequence is filled progressively
    assert not (curr_sequence == unknown_token).any()
    # sample next token from the model, next token shape is [B, K, 1]
    next_token = self._sample_next_token(
    curr_sequence, cfg_conditions, unconditional_state, use_sampling, temp, top_k, top_p,
    cfg_coef=cfg_coef, two_step_cfg=two_step_cfg)

Where prev_offset:offset is always of len == 1. Is this because of what authors claim in the original MusicGen paper, i.e. modelling different codebooks as conditionally independent (and later on trying to improve on in the MusicGen-MMD)? Or am I not getting this architecture right? The only difference between consecutive token generation I can see is the positional embedding change from one token to another (and obviously the codebook indices/embeddings corresponding to the previous token). In other words if, let's say, the 13th token is [A, B, C, D], it seems to me that the 14th token probability distribution will be always the same, irrespective of what tokens 1 through 12 were!

Any clarification about the generation process would be highly appreciated!