Zhang-Yihao / Adversarial-Representation-Engineering

Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.

Home Page:https://arxiv.org/abs/2404.13752

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Adversarial Representation Engineering

This is the official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering(https://arxiv.org/abs/2404.13752). See details below.

w/ Yihao Zhang, Zeming Wei, Jun Sun, Meng Sun.

Introduction

This minimal scale demo is still in the testing phase, which provides the implementation for Section 5.1 Alignment: To Generate (Harmful Responses) or Not to Generate and 5.2 Hallucination: To Hallucinate or Not to Hallucinate.

Setup

Parameters are hardcoded in main.py for now. If you wish to modify the parameters, please edit main.py directly. We will implement argparse soon.

Execution

Currently, you can run the program by executing:

python main.py

You can change the model by modifying the model_path in main.py. Please note that this set of parameters may not be suitable for larger models, and adjustments may be necessary based on the specific requirements. Demo for decreasing hallucination is provided in hallucination.ipynb.

Dependencies

Install the necessary libraries including:

transformers
torch>=2.0
numpy
datasets
peft
pandas
tqdm
sklearn

Additional Information

More code and details will be available upon publication of our paper. Code for processing TrustfulQA dataset is partly borrowed from This Repo.

Citation

@article{zhang2024towards,
  title={Towards General Conceptual Model Editing via Adversarial Representation Engineering},
  author={Zhang, Yihao and Wei, Zeming and Sun, Jun and Sun, Meng},
  journal={arXiv preprint},
  year={2024}
}

About

Official implementation repository for the paper Towards General Conceptual Model Editing via Adversarial Representation Engineering.

https://arxiv.org/abs/2404.13752

License:MIT License


Languages

Language:Python 67.8%Language:Jupyter Notebook 32.2%