MorvanZhou / Reinforcement-learning-with-tensorflow

Simple Reinforcement learning tutorials, 莫烦Python 中文AI教学

https://mofanpy.com/tutorials/machine-learning/reinforcement-learning/

Simple_PPO 中最后一个state的值是否应该为0？

YingxiaoKong opened this issue 4 years ago · comments

YingxiaoKong commented 4 years ago

莫凡你好，我看了你的程序，然后想起来我们老师课上讲过，最后一个state的q-value应该是0.我看了一点其他人的程序，有的人也把最后一个state 的返回值写为0，是否是真的需要呢？

Dr-Tuski commented 4 years ago

最后终止时总要给Q一个赋值吧？不然程序怎么走啊？
实际问题按实际物理意义来，不是0也可以

Morvan commented 4 years ago

是的，需要按照模拟的实际情况来判断赋值多少。

YingxiaoKong commented 4 years ago

最后终值的时候是有一个reward,但是这个reward是达到最后一个状态的前一个状态的reward，而不是最后一个状态的。state(T-1) = reward; state(T) 是终止状态，没有下一个状态，奖赏都是根据下一个状态来给的，所以最后一个状态没有奖赏，也就没有累积奖赏，因此它的q value应该是0。不知道这样想对不对？在 2020年4月21日，下午8:30，Dr-Tuski <notifications@github.com> 写道：最后终止时总要给Q一个赋值吧？不然程序怎么走啊？实际问题按实际物理意义来，不是0也可以 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMorvanZhou%2FReinforcement-learning-with-tensorflow%2Fissues%2F172%23issuecomment-617483779&data=02%7C01%7Cyingxiao.kong%40vanderbilt.edu%7C5dd8c34f571840ba1dca08d7e65cc297%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637231158412019490&sdata=nRvw%2BBhWz67zlXiam1dgDzQqdUW9OI9C%2BYT7ythToEU%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIMGH4XLRAICFAPARMMSKRLRNZCDZANCNFSM4MNTZH3Q&data=02%7C01%7Cyingxiao.kong%40vanderbilt.edu%7C5dd8c34f571840ba1dca08d7e65cc297%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637231158412029478&sdata=MGHmtEgZjqQVslmV0RJIwZCy9GM1jLrki%2FSm%2BIhIdC4%3D&reserved=0>.

Morvan commented 4 years ago

可以这么理解，不过有种情况是环境没有尽头，所以也不会有last state.

YingxiaoKong commented 4 years ago

是的，谢谢！在 2020年4月22日，上午12:10，Morvan <notifications@github.com> 写道：可以这么理解，不过有种情况是环境没有尽头，所以也不会有last state. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FMorvanZhou%2FReinforcement-learning-with-tensorflow%2Fissues%2F172%23issuecomment-617553993&data=02%7C01%7Cyingxiao.kong%40vanderbilt.edu%7C3ff1b3e6b4f54443422a08d7e67b83b7%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637231290480981126&sdata=9LZjL8M9t9bwLSQq47v5nQotV9l0%2F%2BHMAwdkdj2Y1BA%3D&reserved=0>, or unsubscribe<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAIMGH4UD5FLVBGCLYNI7BSDRNZ35LANCNFSM4MNTZH3Q&data=02%7C01%7Cyingxiao.kong%40vanderbilt.edu%7C3ff1b3e6b4f54443422a08d7e67b83b7%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C637231290480981126&sdata=jQp8XIlTYmZT9k5pPCOaR293UTQBZNtvJLRyR%2FkNERA%3D&reserved=0>.