蒙特卡罗采样动作和状态 temp变量为什么是累加呢

Question

ChengchengDu opened this issue a year ago · comments

马尔可夫决策，使用蒙特卡罗估计状态价值时，为什么采样动作和状态的时候，temp是累积的呢？具体代码在3.5节sample函数中的temp+=1