Does opencl-caffe support fp16?

Question

Does opencl-caffe support fp16?

sixsamuraisoldier opened this issue 8 years ago · comments

Hi there, I'm trying to figure out if this branch supports fp16 compute for the rx 480.
Thanks in advance

Fabian Tschopp · Answer 1 · Sun Jul 24 2016 09:10:21 GMT+0800 (China Standard Time)

@sixsamuraisoldier
Only nVidia has an experimental FP16 branch at the moment.
However I will add FP16 support with OpenCL to this branch in near future: https://github.com/BVLC/caffe/tree/opencl

Fabian Tschopp · Answer 2 · Sun Jul 24 2016 09:28:48 GMT+0800 (China Standard Time)

@sixsamuraisoldier
Note that this branch hasn't been updated in over 7 months. I am not sure it is still maintained by anyone, as @gujunli and other coders of this branch have since left AMD and AMD are focusing on HiP/ROCm/SPIR-V based approaches instead.
The official BVLC Caffe OpenCL branch is over at https://github.com/BVLC/caffe/tree/opencl

Gregory Stoner · Answer 3 · Sun Jul 24 2016 10:05:01 GMT+0800 (China Standard Time)

As Naibabf7 stated this was an experimental branch of Caffe, We recommend you to use the BVLC version with OpenCL support moving forward.

One correction on naibaf7 comment we are still going to support OpenCL all future work from the Radeon Compute Team will be in support work up at BVLC. Right now the team is working on solver for machine learning that more optimised for AMD GPU architecture than generic math library.

ROCm is a new driver foundation for Linux compute that supports multiple languages:

Single Source C++ via HCC
HIP for device focused C++ with c-style runtime to simplify CUDA porting
OpenCL this will be out this fall, we are working on a new foundation which supports a much richer set of capabilities it at this time we bring RX480 support as well to ROCm.

I will be changing the readme to point people to OpenCL port at BLVC.

Fabian Tschopp · Answer 4 · Sun Jul 24 2016 12:20:57 GMT+0800 (China Standard Time)

@gstoner
Hey! Nice to hear from you :) great to have some signs from AMD again!
Did you see my latest email to you about 2-3 months back?

Gregory Stoner · Answer 5 · Sun Jul 24 2016 22:33:36 GMT+0800 (China Standard Time)

Right now we are heads down working bringing out new capabilites, For example,

The following family of solution support Single Rate Float16:

Fiji class hardware, Radeon R9 Nano, R9 Fury, R9 Fury X, FirePro S9300x2,
Tonga R380x,
Polaris Family: RX480. RX470, RX460

Here are example of some of the instruction supported, in the new GCN Native ISA Compiler we working hard to expose Float16,

• V_FREXP_EXP_I16_F16 Returns exponent of half precision float input, such that the original single float = significand * (2 ** exponent).

• V_CVT_F16_F32 Float32 to Float16.

• V_ADD_F16 D.f16 = S0.f16 + S1.f16. Supports denormals, round mode, exception flags, saturation.

• V_SUB_F16 D.f16 = S0.f16 - S1.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16.

• V_MAC_F16 16-bit floating point multiply -accumulate

• V_FMA_F16.Fused half precision multiply add.

• V_MAD_F16 Floating point multiply-add (MAD). Gives same result as ADD after MUL_IEEE. Uses IEEE rules for 0*anything.

• V_MADAK_F16 16-bit floating-point multiply-add with constant add operand.

• V_MADMK_F16 16-bit floating-point multiply-add with multiply operand immediate.

• V_COS_F16 Cosine function

• V_SIN_F16 Sin function

• V_EXP_F16 Base2 exponent function

• V_LOG_F16 Base2 log function.

• V_SQRT_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateSqrt(S0.f16).

• V_FRACT_F16 Floating point ‘fractional’ part of S0.f.

• V_RCP_F16 if (S0.f16 == 1.0f), D.f16 = 1.0f; else D.f16 = ApproximateRecip(S0.f16).

• V_RSQ_F16 if(S0.f16 == 1.0f) D.f16 = 1.0f; else D.f16 = ApproximateRecipSqrt(S0.f16).

• V_RNDNE_F16 Floating-point Round-to-Nearest-Even Integer.

• V_TRUNC_F16 Floating point ‘integer’ part of S0.f. D.f16 = trunc(S0.f16). Round-to-zero semantics.

• V_LDEXP_F16

• V_CEIL_F16 Floating point ceiling function.

• V_FLOOR_F16 Floating-point floor function

• V_MAX_F16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_MAX_I16 D.f16 = max(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_MIN_F16 D.f16 = min(S0.f16, S1.f16). IEEE compliant. Supports denormals, round mode, exception flags, saturation.

• V_CVT_PKRTZ_F16_F32 Convert two float 32 numbers into a single register holding two packed 16-bit floats.

• V_DIV_FIXUP_F16 Given a numerator, denominator, and quotient from a divide, this opcode detects and applies special case numerics, modifies the quotient if necessary. This opcode also generates invalid, denorm, and divide by zero exceptions caused by the division.

• V_SUBREV_F16 D.f16 = S1.f16 - S0.f16. Supports denormals, round mode, exception flags, saturation. SQ translates to V_ADD_F16.

Also did you the GCN 3 Architecture support Integer math 32bit & 16 bit

• V_ADD_U16 D.u16 = S0.u16 + S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_SUB_U16 D.u16 = S0.u16 - S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_MAD_I16 Signed integer muladd.

• V_MAD_U16 Unsigned integer muladd.

• V_SAD_U16 Sum of absolute differences with accumulation.

• V_MAX_I16 D.i[15:0] = max(S0.i[15:0], S1.i[15:0]).

• V_MAX_U16 D.u[15:0] = max(S0.u[15:0], S1.u[15:0]).

• V_MIN_I16 D.i[15:0] = min(S0.i[15:0], S1.i[15:0]).

• V_MIN_U16 D.u[15:0] = min(S0.u[15:0], S1.u[15:0]).

• V_MUL_LO_U16 D.u16 = S0.u16 * S1.u16. Supports saturation (unsigned 16-bit integer domain).

• V_CVT_F16_U16 D.f16 = uint16_to_flt16(S.u16). Supports denormals, rounding, exception flags and saturation.

• V_CVT_F16_I16 D.f16 = int16_to_flt16(S.i16). Supports denormals, rounding, exception flags and saturation

• V_SUBREV_U16 D.u16 = S1.u16 - S0.u16. Supports saturation (unsigned 16-bit integer domain). SQ translates this to V_SUB_U16 with reversed operands.

You can find out more on Float16 in the GCN version ISA manual http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/AMD_GCN3_Instruction_Set_Architecture.pdf

Also we now have added to the compiler disassembler/Assembler support and soon inline Assembly support so you be able tune your code even further.

On Jul 23, 2016, at 11:21 PM, Fabian Tschopp <notifications@github.com mailto:notifications@github.com> wrote:

@gstonerhttps://github.com/gstoner
Hey! Nice to hear from you :) great to have some signs from AMD again!
Did you see my latest email to you about 2-3 months back?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234756629, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8Duf2vXe6t4Y8VuWMWGGp0X2KgDhnLks5qYugsgaJpZM4JTemh.

sixsamuraisoldier · Answer 6 · Mon Jul 25 2016 04:57:02 GMT+0800 (China Standard Time)

Thanks everyone for the information, i will post this question on the opencl branch of caffe.

sixsamuraisoldier · Answer 7 · Mon Jul 25 2016 05:25:58 GMT+0800 (China Standard Time)

@gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks

Gregory Stoner · Answer 8 · Mon Jul 25 2016 05:33:49 GMT+0800 (China Standard Time)

On Jul 24, 2016, at 4:26 PM, Tapabrata Ghosh <notifications@github.com mailto:notifications@github.com> wrote:

@gstonerhttps://github.com/gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234803412, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuXaOYlxFOVd6yGjEqs5HqgnJkkuGks5qY9hpgaJpZM4JTemh.

Gregory Stoner · Answer 9 · Mon Jul 25 2016 06:54:02 GMT+0800 (China Standard Time)

I guess i should have bolded the rate it is 1x rate for this generation of GPU, remember the base instruction are part of the GFX8 GPU Family. We have more stuff coming.
The following family of solution support Single Rate Float16:

Fiji class hardware, Radeon R9 Nano, R9 Fury, R9 Fury X, FirePro S9300x2,
Tonga R380x,
Polaris Family: RX480. RX470, RX460

On Jul 24, 2016, at 4:26 PM, Tapabrata Ghosh <notifications@github.com mailto:notifications@github.com> wrote:

@gstonerhttps://github.com/gstoner
One quick question, does polaris (the rx480) support fp16 at a 2:1 ratio?
Thanks

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//issues/55#issuecomment-234803412, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DuXaOYlxFOVd6yGjEqs5HqgnJkkuGks5qY9hpgaJpZM4JTemh.

Fabian Tschopp · Answer 10 · Mon Jul 25 2016 06:58:04 GMT+0800 (China Standard Time)

@gstoner
Excited for the next generation then :)

Follow-up on BVLC Caffe here: BVLC/caffe#4515

Gregory Stoner · Answer 11 · Sat Oct 29 2016 20:52:37 GMT+0800 (China Standard Time)

Our development branch of the LLVM AMDGPU compiler will be supporting Float16 and Int16 native instruction, instead of emulating FP16/Int16 with up-convert & down-convert instructions to convert from FP16/Int16 to Float and back. We now plumbing this through the tools

This is f16 tests on Fiji hardware successfully executing a matrix multiplication with half types with conversion and with Native instructions.

Orig Conversion based:
flat_load_ushort v8, v[6:7]
flat_load_ushort v9, v[4:5]
v_cvt_f32_f16_e32 v8, v8
v_cvt_f32_f16_e32 v9, v9
v_mac_f32_e32 v3, v9, v8

new Native Float :
flat_load_ushort v8, v[6:7]
flat_load_ushort v9, v[4:5]
v_mac_f16_e32 v3, v9, v8

One more thing Eigen has been ported over AMD GPU via HIP.

Gregory Stoner · Answer 12 · Mon Nov 14 2016 00:40:35 GMT+0800 (China Standard Time)

Float16 and Int16 for for Native GFX8.x based GPUs is in LLVM 4.0 source tree, llvm-mirror/llvm@9027123