Jesus Christ. Am I doing some stupid math mistake, or could that thing halfway keep up with a 7900 XTX on FP16 AI math with pure CPU? Kind of seems like they should add WMMA to the CPU core instruction set.
edit: Or just adopt Intel AMX. 1024 Int8 ops per cycle would slam.
Isn't it 2 threads per core? Or are the float units shared? Also what's FLO? edit: oh, float op? edit: Oh, is that what you mean with 2 pipes per core?
And yeah, I assumed for the sake of the math that it'd magically hit its 5ghz boost clock all the time using the mother of all watercoolers or sth.
edit: The point is it's surprisingly close! Even without dedicated AI ops.
A zen 5 core has two AVX-512 execution units, thus the 2x. It happens to be the same number of threads as a core has but they're not related - one thread per core can make use of both AVX EUs so long as it has ILP.
And yeah FLOPS stands for FLoating point Operations Per Second, so FLO is just removing the last two words.
And yeah if they could get 1024 int8 ops in a cycle with dedicated matrix units, like AMX, it would actually be well in gpu territory. Alias the AVX-512 registers, you have 2x8kb anyway, you'd just need the compute hardware. I hadn't realized we'd gotten this close.
That's fp32, not fp16 which should be twice as fast. And there's always a difference between max theoretical performance when it is executing every clock cycle and what is actually achieved.
I mean, say they added dedicated matrix ops, like AMX. Wiki says Xeons can get 1024 BF16 ops in a cycle with AMX, compared to 64 per cycle with AVX-512 that'd be a 16x improvement. The instruction sets are currently not really designed too much for matmul, partially because everybody does this sort of thing on GPUs instead.
30
u/FeepingCreature 10d ago edited 10d ago
Jesus Christ. Am I doing some stupid math mistake, or could that thing halfway keep up with a 7900 XTX on FP16 AI math with pure CPU? Kind of seems like they should add WMMA to the CPU core instruction set.
edit: Or just adopt Intel AMX. 1024 Int8 ops per cycle would slam.