TEEC_InvokeCommand(forward) failed when running fl_tee_layerwise.sh on HiKey 960
HenryHu2000 opened this issue · comments
Hello @mofanv,
I attempted to run fl_tee_layerwise.sh on an HiKey 960, the same board used in the original PPFL paper PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments. However, I'm getting TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3
when running fl_tee_layerwise.sh
, the same error in mofanv/darknetz#14 and mofanv/darknetz#29. Other scripts like fl_tee_standard_noss.sh
and fl_tee_standard_ss.sh
can run correctly.
Since under tz_datasets/cfg folder there don't exist greedy-cnn-aux.cfg
, greedy-cnn-layer1.cfg
, greedy-cnn-layer2.cfg
, greedy-cnn-layer3.cfg
and mnist_greedy-cnn.cfg
files that are required by fl_tee_layerwise.sh
, I manually copy-pasted them from PPFL/server_side_sgx/cfg
.
Error log:
============= initialization =============
============= layer 1 =============
============= round 1 =============
============= copy weights server -> client 1 =============
Warning: Permanently added '[127.0.0.1]:8888' (ECDSA) to the list of known hosts.
real 0m1.711s
user 0m0.008s
sys 0m0.000s
tee weights: 82356 Bytes
============= ssh to the client and local training =============
layer filters size input output
0 conv_TA 2 3 x 3 / 1 32 x 32 x 3 -> 32 x 32 x 2 0.000 BFLOPs
1 conv_TA 2 3 x 3 / 1 32 x 32 x 2 -> 32 x 32 x 2 0.000 BFLOPs
2 connected_TA 2048 -> 10
Prepare session with the TA
Begin darknet
mnist_greedy-cnn
1
workspace_size=110592
3 softmax_TA 10
4 cost_TA 10
Loading weights from /root/models/mnist/mnist_greedy-cnn_global.weights...Done!
Learning Rate: 0.01, Momentum: 0.9, Decay: 5e-05
3000
32 28
output file: /media/results/train_mnist_greedy-cnn_pps0_ppe4.txt
current_batch=10
Loaded: 0.003913 seconds
darknetp: TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3
real 0m1.594s
user 0m0.003s
sys 0m0.005s
I checked mofanv/darknetz#14 and mofanv/darknetz#29 and attempted to increase TA_STACK_SIZE
and TA_DATA_SIZE
in ta/include/user_ta_header_defines.h I have the following values, but am still getting the error. I cannot increase them further because that would cause a TEEC_Opensession failed with code 0xffff000c origin 0x3
error as from mofanv/darknetz#32.
/* Provisioned stack size */
#define TA_STACK_SIZE (1 * 1024 * 1024)
/* Provisioned heap size for TEE_Malloc() and friends */
#define TA_DATA_SIZE (12 * 1024 * 1024)
I isolated the command darknetp classifier train -pp_start_f 0 -pp_end 4 -ss 2 "cfg/mnist.dataset" "cfg/mnist_greedy-cnn.cfg" "/root/models/mnist/mnist_greedy-cnn_global.weights"
that failed and tried to run it manually on the client. -pp_start_f 0 -pp_end 4
fails but -pp_start_f 0 -pp_end 3
can run. It seems that layer 4 is the one that cannot fit into TEE memory.
Do you know what the original configuration used in PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments was? Thank you!
Hi @HenryHu2000 , the TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3
error is typically caused by the secure memory limits. When one layer's weight matrix is created during the forward pass, out-of-memory happens. But I found in your test, the layer is quite small, and seems not large enough to trigger this problem?
The cfg
files you mentioned are not in tz_datasets/cfg
, but server_side_sgx/cfg
. You may try run again with these cfg
files inside
The
cfg
files you mentioned are not intz_datasets/cfg
, butserver_side_sgx/cfg
. You may try run again with thesecfg
files inside
Hi @mofanv, thanks for your reply. Yes, I used the cfg files in server_side_sgx/cfg
but was still getting these errors. Without these cfg files, fl_tee_layerwise.sh
doesn't run.
Hi @HenryHu2000 , the
TEEC_InvokeCommand(forward) failed 0xffff3024 origin 0x3
error is typically caused by the secure memory limits. When one layer's weight matrix is created during the forward pass, out-of-memory happens. But I found in your test, the layer is quite small, and seems not large enough to trigger this problem?
I followed the exactly same configuration as in the paper. I tried the following 3 configurations on fl_tee_layerwise.sh
but none of them worked:
- Device=HiKey 960, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=10 * 1024 * 1024 (default settings from the repo)
- Device=HiKey 960, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=12 * 1024 * 1024
- Device=Raspberry Pi 3, TA_STACK_SIZE=1 * 1024 * 1024, TA_DATA_SIZE=6 * 1024 * 1024
However, other scripts like fl_tee_standard_noss.sh and fl_tee_standard_ss.sh do run correctly. It seems that changing the flag -ss 2
to -ss 1
also avoids the error, but I guess it breaks the intended purpose.