facebookresearch / adaptive-span

Transformer training code for sequential tasks

facebookresearch/adaptive-span Issues

The way you preprocess data is different from that of Transformer-XL
Closed 5 years ago5
A question about parameter z_t
Closed 2 years ago9
Understanding adaptive-span loss
Closed 2 years ago7
Generate text
Closed 2 years ago1
What does batch-size mean using distributed trainning?
Closed 2 years ago1
Accept a mask to remove padding in batch
Closed 2 years ago1
confuse
Closed 2 years ago1
what is the cache_size mean?
Closed 2 years ago1
Where to find the pretrained checkpoint?
Closed 2 years ago1
Why does the hyper-parameter --batch-sz affect the bpc during evaluation?
Closed 2 years ago3
Please convert to a permissive license
Updated 4 years ago
Understanding graphs from papers
Updated 4 years ago
BPC
Closed 4 years ago6
Warning with PyTorch 1.4
Closed 4 years ago4
Queries about adaptive span
Closed 4 years ago1
Compute attention span of individual attention heads
Closed 4 years ago1
Will adaptive-span have faster predictive speeds than gpt-2?
Closed 4 years ago2
why not compare other local attention methods？
Closed 5 years ago2
did you try to start with maximum possibile cache size
Closed 5 years ago2
Question: How to reduce the memory in this project
Closed 5 years ago7
Using mask can reduce FLOPs?
Closed 5 years ago2