ai4co / rl4co

Describe the bug

The warmup baseline should simply return the evaluation results of the "normal" baseline (self.baseline) once the number of warmup epochs is exceeded. However, the alpha attribute takes on values larger than 1 for all subsequent epochs, leading to weird baseline values.

rl4co/rl4co/models/rl/reinforce/baselines.py

Line 128 in fd58215

self.alpha = (kw["epoch"] + 1) / float(self.n_epochs)

Moreover, a misplaced parentheses causes results in a wrong combination of exponential and actual baseline loss

rl4co/rl4co/models/rl/reinforce/baselines.py

Lines 121 to 123 in fd58215

    
           return self.alpha * v_b + (1 - self.alpha) * v_wb, self.alpha * l_b + ( 
        
               1 - self.alpha * l_wb 
        
           )

To Reproduce

No breaking bug. See results postet in wouterkool/attention-learn-to-route#51

Checklist

I have checked that there is no similar issue in the repo (required)
I have provided a minimal working example to reproduce the bug (required)

	return self.alpha * v_b + (1 - self.alpha) * v_wb, self.alpha * l_b + (
	1 - self.alpha * l_wb
	)

[BUG] Wrong behavior in warmup baseline

Describe the bug

To Reproduce

Checklist