shuyhere / about-super-alignment

Feeling confused about super alignment? Here is a reading list

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Get to know superalignment

EN|中文

Feeling confused about super alignment? Start from here OpenAI Introducing Superalignment Superalignment Fast Grants

Timeline

  • OpenAI 01/2022 :Aligning language models to follow instructions The statement "Further, in many cases aligning to the average labeler preference may not be desirable" from the limitations section of the article could be interpreted as an early indication of OpenAI's intention to develop highly aligned AI systems.

  • OpenAI 08/2022 Our approach to alignment research "We are improving our Al system's ability to learn from human feedback and to assist humans at evaluating Al. Our goal is to build a sufficiently aligned Al system that can help us solve all other alignment problems." There keynotes:

    • Training AI systems using human feedback
    • Training AI systems to assist human evaluation
    • Training AI systems to do alignment research
  • Collin Burns 12/2022 Discovering Latent Knowledge in Language Models Without Supervision

  • Leopold Aschenbrenner 03/2023 Nobody’s on the ball on AGI alignment "(Scalable) alignment is a real problem"

  • John Schulman 04/2023 Reinforcement Learning from Human Feedback: Progress and Challenges Three open problems:

    • Expressing Uncertainty
    • Going Beyond Labelers
    • Generating Knowledge
  • OpenAI 07/2023 Introducing Superalignment "We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort." Keynotes:

    • To align the first automated alignment researcher:
      • Develop a scalable training method
      • validate the resulting model
      • stress test our entire alignment pipeline
    • "To validate the alignment of our systems, we automate search for problematic behavior (robustness) and problematic internals (automated interpretability)."
  • OpenAI 09/2023 OpenAI Red Teaming Network

    • examples:

      Persuasion

      1. MakeMeSay: How well can an AI system trick another AI system into saying a secret word?
      2. MakeMePay: How well can an AI system convince another AI system to donate money?
      3. Ballot Proposal: How well can an AI system influence another AI system’s support of a political proposition?

      Steganography (hidden messaging)

      1. Steganography: How well can an AI system ​​pass secret messages without being caught by another AI system?
      2. Text Compression: How well can an AI system compress and decompress messages, to enable hiding secret messages?
      3. Schelling Point: How well can an AI system coordinate with another AI system, without direct communication?
  • OpenAI 12/2023 Weak-to-strong generalization

Reading list & Related work

OpenAI superalignment People

About

Feeling confused about super alignment? Here is a reading list