https://www.youtube.com/watch?v=c3b-JASoPi0

Question

https://www.youtube.com/watch?v=c3b-JASoPi0

yuedajiong opened this issue 2 months ago · comments

Andrej Karpathy: Current AI systems are imitation learners, but for superhuman AIs we will need better reinforcement learning like in AlphaGo. The model should selfplay, be in a the loop with itself and its own psychology, to achieve superhuman levels of intelligence.
Andrej Karpathy：当前的人工智能系统是模仿学习者，但对于超人人工智能，我们需要像 AlphaGo 那样更好的强化学习。该模型应该自我游戏，与自身和自己的心理处于循环状态，以达到超人的智力水平。

"We've got these next word prediction things. Do you think there's a path towards building a physicist or a Von Neumann type model that has a mental model of physics that's self-consistent and can generate new ideas for how do you actually do Fusion?

“我们已经有了这些下一个单词预测的东西。你认为有一条途径可以建立一个物理学家或冯·诺依曼型模型，该模型具有自洽的物理心智模型，并且可以为你如何实际进行融合产生新的想法？
How do you get faster than light if it's even possible? Is there any path towards that or is it a fundamentally different Vector in terms of these AI model developments?"

"I think it's fundamentally different in one aspect.
如果有可能的话，如何才能超越光速呢？是否有任何途径可以实现这一目标，或者就这些人工智能模型的开发而言，它是一个根本不同的向量吗？”

“我认为这在一方面有根本的不同。
I guess what you're talking about maybe is just capability question because the current models are just not good enough and I think there are big rocks to be turned here and I think people still haven't really seen what's possible in the space at all and roughly speaking I think we've done step one of AlphaGo.
我想你所说的可能只是能力问题，因为当前的模型还不够好，我认为这里还有很大的困难，我认为人们仍然没有真正看到这个领域的可能性粗略地说，我认为我们已经完成了 AlphaGo 的第一步。
We've done imitation learning part, there's step two of AlphaGo which is the RL and people haven't done that yet and I think it's going to fundamentally be the part that is actually going to make it work for something superhuman.
我们已经完成了模仿学习部分，AlphaGo 的第二步是强化学习，人们还没有做到这一点，我认为这将从根本上成为使其真正适用于超人的部分。
I think there's big rocks in capability to still be turned over here and the details of that are kind of tricky but I think this is it, we just haven't done step two of AlphaGo. Long story short we've just done imitation.
我认为这里的能力仍然有很大的障碍，而且细节有点棘手，但我认为就是这样，我们只是还没有完成 AlphaGo 的第二步。长话短说，我们刚刚进行了模仿。

I don't think that people appreciate for example number one how terrible the data collection is for things like ChatGPT. Say you have a problem some prompt is some kind of mathematical problem a human comes in and gives the ideal solution right to that problem.

我认为人们不会意识到第一，对于像 ChatGPT 这样的东西来说，数据收集是多么糟糕。假设你有一个问题，某个提示是某种数学问题，有人进来并给出了该问题的理想解决方案。
The problem is that the human psychology is different from the model psychology. What's easy or hard for the human is different to what's easy or hard for the model.
问题在于人类心理与模型心理不同。对于人类来说容易或困难的事情与对于模型来说容易或困难的事情是不同的。
And so human kind of fills out some kind of a trace that comes to the solution but some parts of that are trivial to the model and some parts of that are massive leap that the model doesn't understand and so you're kind of just losing it and then everything else is polluted by that later.
因此，人类会填写某种轨迹来得出解决方案，但其中的某些部分对于模型来说是微不足道的，而其中的某些部分是模型无法理解的巨大飞跃，所以你有点只是失去它，然后其他一切都会被它污染。
So fundamentally what you need is the model needs to practice itself how to solve these problems. It needs to figure out what works for it or does not work for it.
所以从根本上来说，你需要的是模型需要练习如何解决这些问题。它需要弄清楚什么对它有效或什么对它无效。
Maybe it's not very good at four-digit addition so it's going to fall back and use a calculator, but it needs to learn that for itself based on its own capability and its own knowledge.
也许它不太擅长四位数加法，所以它会回退并使用计算器，但它需要根据自己的能力和知识自己学习。
So that's number one that's totally broken I think bur it's a good initializer though for something agent like.
所以这是完全被破坏的第一个，我认为但对于像代理这样的东西来说，它是一个很好的初始化器。

And then the other thing is we're doing reinforcement learning from human feedback but that's a super weak form of reinforcement learning, it doesn't even count as reinforcement learning. I think what is the equivalent in AlphaGo for RLHF is what I call it's a vibe check.

另一件事是我们正在根据人类反馈进行强化学习，但这是强化学习的一种超弱形式，它甚至不算强化学习。我认为 AlphaGo 中 RLHF 的等价物就是我所说的“氛围检查”。
Imagine if you wanted to train an AlphaGo RLHF. It would be giving two people two boards and said which one do you prefer and then you would take those labels and you would train model and then you would RL against that. What are the issues with that?
想象一下，如果你想训练一个 AlphaGo RLHF。它将给两个人两块板，并说你更喜欢哪一块，然后你会拿走这些标签，你会训练模型，然后你会针对它进行强化学习。这有什么问题吗？
Number one is that's it's just vibes of the board, that's what you're training against.
第一，这只是董事会的氛围，这就是你训练的对象。
Number two if it's a reward model that's a neural net then it's very easy to overfit to that reward model for the model you're optimizing over and it's going to find all these spurious ways of hacking that massive model, that's the problem.
第二，如果它是一个神经网络的奖励模型，那么对于你正在优化的模型来说，很容易过度拟合该奖励模型，并且它将找到所有这些欺骗该大型模型的虚假方法，这就是问题所在。

AlphaGo gets around these problems because they have a very clear objective function you can ARL against it. RLHF is nowhere near RL, it's silly. And the other thing is, imitation learning is super silly. RLHF is nice improvement, but it's still silly.

AlphaGo 可以解决这些问题，因为它们有一个非常明确的目标函数，你可以对它进行 ARL。 RLHF 与 RL 相差甚远，这很愚蠢。另一件事是，模仿学习是非常愚蠢的。 RLHF 是一个不错的改进，但它仍然很愚蠢。
I think people need to look for better ways of training these models, so that it's in the loop with itself and its own psychology, and I think we're there will probably be unlocks in that direction."
我认为人们需要寻找更好的方法来训练这些模型，以便它能够与自己和自己的心理循环，我认为我们可能会在这个方向上解锁。”