Some tests failing in the M3 Max
dgazzoni opened this issue · comments
This is the output of make test
in an Apple M3 Max:
Testing AMX_LDX... Failed on iteration 0.0 (operand 0xe7ee80da3d4b9e09)
Testing AMX_LDY... Failed on iteration 0.0 (operand 0xe7ee80da3d4b9e09)
Testing AMX_LDZ... OK
Testing AMX_LDZI... OK
Testing AMX_STX... OK
Testing AMX_STY... OK
Testing AMX_STZ... OK
Testing AMX_STZI... OK
Testing AMX_EXTRX... OK
Testing AMX_EXTRY... OK
Testing AMX_MAC16... OK
Testing AMX_FMA16... OK
Testing AMX_FMA32... OK
Testing AMX_FMA64... OK
Testing AMX_FMS16... OK
Testing AMX_FMS32... OK
Testing AMX_FMS64... OK
Testing AMX_VECINT... OK
Testing AMX_VECFP... OK
Testing AMX_MATINT... Failed on iteration 1.575 (operand 0xd060b15ad5995f37)
Testing AMX_MATFP... OK
Testing AMX_GENLUT... OK
This suggests either Apple broke compatibility with the previous versions, or there are new features using some of the previously-ignored bit in the parameters to these instructions.
I think the former is unlikely, as I have been writing lots of AMX code lately, with excellent test coverage, and I'm yet to see any unexplained failures in my software tests, e.g. something that behaves differently from the M3 than the M1 I also have here (using your documented M1 features, and also some of the documented M2 features, which work as expected on the M3). So hopefully there are new features in the M3.
I investigated this a bit by changing the random values, and I see that for AMX_LDX
and AMX_LDY
, out of the previously ignored bits (63, 61 and 59), only bit 61 is always set in case of a test error; for 63 and 59, they are sometimes set and sometimes not (indeed, I've seen an error for which bits 63 and 59 were not set, only 61 was).
So I wrote a small program to investigate this, and found that bit 61 represents a strided load: when loading pairs, the stride is 4 (that is, if you start at X0
, it loads to X0
and X4
), whereas when loading 4 at a time, the stride is 2 (e.g. X0
, X2
, X4
, X6
). I will attach a test program and its output on my M3. For AMX_LDY
, results are identical.
As for AMX_MATINT
, I collected a bunch of values where the tests fail:
0xd060b15ad5995f37
0xd26ab256885620e0
0xd060b15ad5995f37
0x0e61b375b73c8104
0xbc7870a58e4864bc
0xce6b3046d4af6812
0xa069335db08b4b0e
0x5a71b34bf47fe485
0xee7172c6ce0a04ec
0xa87ef14a1baca54d
0xb662b045bc40cdb0
0xc4697074e454ab6f
ANDing these together, the common theme is bits 44, 45, 53 and 54 set. I see that having bits 53 and 54 set means an indexed load in ALU mode 8. For that mode, there are two lane width modes (i.e. bits 45:42): 10 or any other value. However, having bits 44 and 45 set would correspond to 12.
If you'd like to investigate, but don't have access to an M3, I can run any tests you need; just let me know.
I had been wondering whether there were any additions/enhancements in M3, and seemingly the answer is yes. I don't have any M3 hardware of my own at the moment, and to do the reverse engineering myself I'd need SSH access to an M3 machine for a few days.
Thanks for the reply.
I can gladly create an account for ssh access on my M3. However it is my personal laptop, so (rarely) it may be offline — however it’s even more rare for me to shut it down completely. So using tmux
/screen
should be a workable solution.
If you’re up for this, we just need a way to contact each other. I don’t see an email listed either in your repository or in your website. How can I contact you? I’m not familiar with any GitHub features like direct messages. Failing that I could reply here with my email address.
For matint
, looks like M3 adds the configuration of ALU mode 8 lane width mode 12, interpreted as 8-bit X and 16-bit Y and 32-bit Z (similar to vecint
lane width mode 12). I think e159758 covers it, along with the ldx
/ldy
extension you described.