causal decoder based on convolutions only (no attention): can be applied to ubbounded sequence lengths; the prediction of the next token depends on *all* previous tokens; allows autoregressive sampling; highly gpu-parralellizable; trained with teacher forcing;
causal decoder based on convolutions only (no attention): can be applied to ubbounded sequence lengths; the prediction of the next token depends on *all* previous tokens; allows autoregressive sampling; highly gpu-parralellizable; trained with teacher forcing;