adabrazerzkidai.blogg.se - Gsdx v0.1.16 avx or sse2

#GSDX V0.1.16 AVX OR SSE2 FULL#
#GSDX V0.1.16 AVX OR SSE2 CODE#

The VEX encoding already supports extending it to 1024-bit registers, and by executing such instructions on 256-bit units in four cycles, the CPU's front-end and scheduler will have four times less switching activity, hence dramatically lowering the power consumption. The only remaining deal breaker is the higher power consumption. AVX2 will already be perfectly suitable for graphics shaders. So the GPU and CPU are converging.Įventually it will make sense to just move all programmable throughput computing to the CPU. So you can't get rid of the CPU's SIMD units any time soon, and the GPU is evolving into a CPU architecture to support more complex generic code.

#GSDX V0.1.16 AVX OR SSE2 CODE#

GPUs also still have a lot of catching up to do to support complex code and not choke due to latency and bandwidth. Note that a Haswell quad-core will be capable of 500 GFLOPS, while today's 22 nm HD 4000 can only do about 300 GFLOPS. This should make vector instructions useful in a lot of places they weren't before, because a lot of loops can then be trivially vectorized by the compiler.Ĭlick to expand.The reverse will happen.

#GSDX V0.1.16 AVX OR SSE2 FULL#

they take a base address and a vector full of offsets, and fill the target register with. (position and speed = 2 4-element vectors.)ĪVX2 brings gather instructions, which are basically vectorized loads. So today, only the things that are absolutely trivial tend to get optimized. There is some downright heroic work on the subject by the Intel and GCC teams, but even they really don't get that much speedup from autovectorized code. It's hard to do by hand, and nigh-impossible to do automatically by a compiler. You can probably see why this gets hairy fast. So instead of putting the value in the object, you have to build an array that has one value from each object, for each value in said objects. And since cross-lane operations are slow, you ideally want the vectors to have elements from different objects. So not only do you need to use special instructions, pre-AVX2 you have to layout your data so that you can load consecutive (16-byte aligned) elements into memory. So instead of loading two individual elements, you load two vectors of 4 or 8 and multiply each element of one vector with the corresponding element of the other vector. SIMD does not make the operations any faster, it allows you to do more of them at the same time. But that's not all that much faster than x87. About VMX on Xenon I have absolutely no idea what the throughput for even the "basic" operations (float vec mul, add) are just because the instructions are 4-wide doesn't tell you much what the cpu can do per clock, not sure if that information was published anywhere for Xenon (it might be possible that just like older cpus supporting sse2 they really only have 2-wide instead of 4-wide execution units for instance).įor doing math on single elements, yeah sure. I don't know much about VMX, I believe it has some better support for horizontal operations and shuffles but if you can benefit from such instructions can't be said generally. This assumes though your algorithm really can be adjusted to use 8-wide floats trivially, and further assumes no load/store bottlenecks (sandy can load 2 128bit values and store 1 128bit value per clock) not to mention obviously other things like limitations due to memory bandwidth or latency also still are the same. The instructions are just mostly slightly different with AVX since the vex encoding has non-destructive (3 operand) syntax (makes the instructions slightly larger but saves most register-register move instructions which should be good for some small performance improvement).ĪVX with ints is thus just just minimally faster than SSE4 on the same cpu (the only advantage comes from less move instructions), and with floats it's a bit more than twice as fast in theory (except for divisions on sandy as the divide unit is only 4-wide though Ivy "fixed" that). sse2 instructions vary greatly between different cpus.įurthermore the instruction set of AVX isn't actually different to SSE(4), it's exactly the same instructions just extended to 256bit (well for floats only - 256bit ints need to wait til AVX2, Haswell). You cannot really say which instruction set is faster as that would be dependent on implementation.