This bug is in /dev/forward/attention_forward.cu, kernel function 'attention_softmax_kernel1':
The shape of 'preatt' and 'att' are (B, NH, T, T), and the total thread size is 'B * NH * T', so the head index 'h' should be: h = (idx / T) % NH
And the time index 't' should be: t = idx % T