TEEC_InvokeCommand(forward) failed when running fl_tee_layerwise.sh on HiKey 960

Question

TEEC_InvokeCommand(forward) failed when running fl_tee_layerwise.sh on HiKey 960

HenryHu2000 opened this issue a year ago · comments

Hello @mofanv,
I attempted to run fl_tee_layerwise.sh on an HiKey 960, the same board used in the original PPFL paper PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments. However, I'm getting TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3 when running fl_tee_layerwise.sh, the same error in mofanv/darknetz#14 and mofanv/darknetz#29. Other scripts like fl_tee_standard_noss.sh and fl_tee_standard_ss.sh can run correctly.

Since under tz_datasets/cfg folder there don't exist greedy-cnn-aux.cfg, greedy-cnn-layer1.cfg, greedy-cnn-layer2.cfg, greedy-cnn-layer3.cfg and mnist_greedy-cnn.cfg files that are required by fl_tee_layerwise.sh, I manually copy-pasted them from PPFL/server_side_sgx/cfg.
Error log:

  ============= initialization =============
  ============= layer 1 =============
  ============= round 1 =============
  ============= copy weights server -> client 1 =============
  Warning: Permanently added '[127.0.0.1]:8888' (ECDSA) to the list of known hosts.
  
  real    0m1.711s
  user    0m0.008s
  sys     0m0.000s
  tee weights: 82356 Bytes
  ============= ssh to the client and local training =============
  layer     filters    size              input                output
      0 conv_TA    2  3 x 3 / 1    32 x  32 x   3   ->    32 x  32 x   2  0.000 BFLOPs
      1 conv_TA    2  3 x 3 / 1    32 x  32 x   2   ->    32 x  32 x   2  0.000 BFLOPs
      2 connected_TA                         2048  ->    10
  Prepare session with the TA
  Begin darknet
  mnist_greedy-cnn
  1
  workspace_size=110592
      3 softmax_TA                                       10
      4 cost_TA                                          10
  Loading weights from /root/models/mnist/mnist_greedy-cnn_global.weights...Done!
  Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
  3000
  32 28
  output file: /media/results/train_mnist_greedy-cnn_pps0_ppe4.txt
  current_batch=10 
  Loaded: 0.003913 seconds
  darknetp: TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3
  
  real    0m1.594s
  user    0m0.003s
  sys     0m0.005s

I checked mofanv/darknetz#14 and mofanv/darknetz#29 and attempted to increase TA_STACK_SIZE and TA_DATA_SIZE in ta/include/user_ta_header_defines.h I have the following values, but am still getting the error. I cannot increase them further because that would cause a TEEC_Opensession failed with code 0xffff000c origin 0x3 error as from mofanv/darknetz#32.

/* Provisioned stack size */
#define TA_STACK_SIZE			(1 * 1024 * 1024)

/* Provisioned heap size for TEE_Malloc() and friends */
#define TA_DATA_SIZE			(12 * 1024 * 1024)

I isolated the command darknetp classifier train -pp_start_f 0 -pp_end 4 -ss 2 "cfg/mnist.dataset" "cfg/mnist_greedy-cnn.cfg" "/root/models/mnist/mnist_greedy-cnn_global.weights" that failed and tried to run it manually on the client. -pp_start_f 0 -pp_end 4 fails but -pp_start_f 0 -pp_end 3 can run. It seems that layer 4 is the one that cannot fit into TEE memory.

Do you know what the original configuration used in PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments was? Thank you!

Mo, Fan Vincent · Answer 1 · Thu Apr 20 2023 11:46:13 GMT+0800 (China Standard Time)

Hi @HenryHu2000 , the TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3 error is typically caused by the secure memory limits. When one layer's weight matrix is created during the forward pass, out-of-memory happens. But I found in your test, the layer is quite small, and seems not large enough to trigger this problem?

Mo, Fan Vincent · Answer 2 · Thu Apr 20 2023 11:47:57 GMT+0800 (China Standard Time)

The cfg files you mentioned are not in tz_datasets/cfg, but server_side_sgx/cfg. You may try run again with these cfg files inside

Henry Hu · Answer 3 · Thu Apr 20 2023 21:36:59 GMT+0800 (China Standard Time)

The cfg files you mentioned are not in tz_datasets/cfg, but server_side_sgx/cfg. You may try run again with these cfg files inside

Hi @mofanv, thanks for your reply. Yes, I used the cfg files in server_side_sgx/cfg but was still getting these errors. Without these cfg files, fl_tee_layerwise.sh doesn't run.

Henry Hu · Answer 4 · Thu Apr 20 2023 21:43:03 GMT+0800 (China Standard Time)

Hi @HenryHu2000 , the TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3 error is typically caused by the secure memory limits. When one layer's weight matrix is created during the forward pass, out-of-memory happens. But I found in your test, the layer is quite small, and seems not large enough to trigger this problem?

I followed the exactly same configuration as in the paper. I tried the following 3 configurations on fl_tee_layerwise.sh but none of them worked:

Device=HiKey 960, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=10 * 1024 * 1024 (default settings from the repo)
Device=HiKey 960, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=12 * 1024 * 1024
Device=Raspberry Pi 3, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=6 * 1024 * 1024

However, other scripts like fl_tee_standard_noss.sh and fl_tee_standard_ss.sh do run correctly. It seems that changing the flag -ss 2 to -ss 1 also avoids the error, but I guess it breaks the intended purpose.