Leo-Lsc/AutoData

artificial-intelligence counterfactual dataset framework gpt knowledge-editing natural-language-processing open-source

AutoData

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

Table of Contents

About The Project
Getting Started
- Prerequisites
- Pip-Installation
Overview
Contributing

About The Project

The data stored in language models (LMs) quickly becomes obsolete, and retraining these models from the ground up is often not feasible. Recently, various methods (e.g. SERAC, IKE, MEND, KE, ROME, MEMIT, FT-L) have been developed to inject new knowledge.

Current methods mostly perform well in editing single atom facts, but they encounter catastrophic failures when tested on the ripple effects caused by the edited knowledge. For example, if we edit the information to state that the current President of the USA is Trump, then the answer to "Who is married to Trump?" should also change accordingly. While many datasets for evaluating knowledge editing of LMs exist, they predominantly focus on facts from Wikidata, primarily relating to people and events.

In other words, the data in these datasets is homogeneous and lacks diversity. Besides, This type of dataset construction pipeline often inevitably involves parts such as manual annotation and crowdsourcing, leading to significant time and economic costs. Therefore, I implemented a framework, AutoData, that can automatically construct datasets containing various types of data based on specific needs.

Getting Started

Prerequisites

You should have at least one API key from a large language model, preferably from OpenAI.

Pip Installation

git clone https://github.com/Leo-Lsc/AutoData.git
conda create -n AutoData python=3.11.8
cd AutoData
pip install -r requirements.txt

Overview

AutoData is a framework that uses the LangChain library and OpenAI's API to automatically construct customized datasets. AutoData consists of five modules: SubjectGenerator, QA_Generator, TripleExtractor, Interrupter and TwoHopQuestionGenerator.

Contributing

If you have a suggestion that would make this better, please fork the repo and create a pull request. Any contributions you make are greatly appreciated. Don't forget to give the project a star! Thanks!

Contributors

_Leo-Lsc

Citation

Please use the following citation if you intend to use AutoData:

@misc{AutoDataFramework,
  title={AutoData},
  author={Sicheng Lai},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Leo-Lsc/AutoData}},
}

About

An Automated Framework to Construct Datasets for Assessing Knowledge Editing or Multi-Hop Reasoning Capability of Language Models.

artificial-intelligence counterfactual dataset framework gpt knowledge-editing natural-language-processing open-source

MIT License

Languages

Language:Python 100.0%