exception after few hours of training with villain
eassa opened this issue · comments
i am using ASUS TUF Gaming Radeon™ RX 7900 XT OC Edition 20GB GDDR6
windows 10
i followed these instruction
https://forum.faceswap.dev/app.php/faqpage?sid=47859b5acaac6c66cf49a85c70d6b1bd#f1r1
https://forum.faceswap.dev/viewtopic.php?t=20
DirectML installation
while training in villain with batch size of 20 , i am getting this error after few hours of training
, i have been getting this error multiple time already :
2024-04-08 03:15:23.769735: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0005: chunk->resource->Map(0, nullptr, &upload_heap_data)
2024-04-08 03:15:23.769986: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887
i always get the exception noted in the issue , but this time after 10 hours of training i got this exception as well
2024-04-08 13:32:47.839589: F tensorflow/c/logging.cc:43] HRESULT failed with 0x887a0001: dml_device_->GetDeviceRemovedReason()
Unfortunately this issue is upstream from us and comes from a timeout within DirectML. See below for more information and potential mitigation steps: