rmoehn / farlamp

IDA with RL and overseer failures

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Farlamp

Note 2021-07-31: I decide to leave AI alignment research last year. See https://www.lesswrong.com/posts/HDXLTFnSndhpLj2XZ/i-m-leaving-ai-alignment-you-better-stay This repo shows the state of my research at the time I left.

BLUF: Read Overseer failures in SupAmp and ReAmp, then the interesting parts of Training a tiny SupAmp model on easy tasks.

Project definition (CoR):

I'm studying the impact of overseer failure on RL-based IDA,
    because I want to know under what conditions amplification and distillation
    increase or decrease the failure rate,
        in order to help my reader understand whether explicit reliability
        amplification is necessary for IDA to work in practice.

In this project I will:

  1. Take the implementation of iterated distillation and amplification from Christiano et al.'s ‘Supervising strong learners by amplifying weak experts’, introduce overseer failures and see how they influence the overall failure rate.
  2. Adapt the system to reinforcement learning. (It uses supervised learning now.)
  3. Introduce overseer failures in the RL setting and see how they influence the overall failure rate.
  4. Write a paper about the results.

Overseer failures in SupAmp and ReAmp contains a more extensive introduction, as well as an explanation of the relevant terms, concepts etc.

For the code see rmoehn/amplification, which is a fork of paulfchristiano/amplification.

Repository contents

There are more files, but they are only useful for me. The code won't be published here, because it will be based on the code from CSASupAmp, which underlies some strict publication policy.

Glossary

Term Definition
CoR Booth et al.: The Craft of Research
CSASupAmp Christiano et al.: Supervising strong learners by amplifying weak experts
Est. 5 % 5th percentile of my estimated duration distribution/leftmost point in triangle distribution
Est. mode mode of my estimated duration distribution
Est. 95 % 95th percentile of my estimated duration distribution/rightmost point in triangle distribution
Farlamp Failures in RL-based amplification (I just had to come up with a short project name.)
Draft Basis A template derived from CoR, p. 175, which when filled in completely, provides all the information necessary for planning a draft. Includes the structure of the argument.
LW LessWrong
MxD MIRIxDiscord
RL reinforcement learning
ReAmp SupAmp adapted to RL
SL supervised learning
SupAmp The system from CSASupAmp for iterated distillation and amplification using supervised learning

For detailed bibliographical information see references.bib.

Thanks

Thanks to Paul Christiano for funding this project and giving me advice. Thanks also to William Saunders for providing his version of the CSASupAmp code.

Licence

CC0
To the extent possible under law, Richard Möhn has waived all copyright and related or neighboring rights to Farlamp documentation. This work is published from: Japan.

About

IDA with RL and overseer failures


Languages

Language:TeX 96.0%Language:Python 3.2%Language:Shell 0.9%