ucbrise / actnn

May I known how you guys implemented this defragmentation in ActNN?

In my model training experience: smaller MAX_SPLIT_SIZE, worse performance. bigger MAX_SPLIT_SIZE, will finally result OOM

The corresponding code is here

actnn/actnn/actnn/conf.py

Lines 26 to 31 in 370bc9f

    
           elif level == 'L5':    # 2-bit + swap + defragmentation 
        
               config.swap = True 
        
               os.environ['PYTORCH_CACHE_THRESHOLD'] = '256000000' 
        
               warnings.warn("The defragmentation at L5 requires modification of the c++ " 
        
                             "code of PyTorch. You need to compile this special fork of " 
        
                             "PyTorch: https://github.com/merrymercy/pytorch/tree/actnn_exp")

Great thanks, got that. use malloc instead of caching allocator for large size

	elif level == 'L5': # 2-bit + swap + defragmentation
	config.swap = True
	os.environ['PYTORCH_CACHE_THRESHOLD'] = '256000000'
	warnings.warn("The defragmentation at L5 requires modification of the c++ "
	"code of PyTorch. You need to compile this special fork of "
	"PyTorch: https://github.com/merrymercy/pytorch/tree/actnn_exp")

how to avoid memory fragmentation in ActNN?