Synthetic LLM data

Code for the paper "AI 'News' Content Farms are easy to make and hard to detect: a case study in Italian"

Model Fine Tuning

Our fine-tunings are all done through llm-foundry on the change-it dataset.

In the folder foundry_yamls you can find the yamls used to fine-tune the models, and we refer to the llm-foundry documentation for how to set up the fine-tuning. In particular, the change_it folder contains all the files used to fine-tune on Italian.

Synthetic Text Detection

The experiments folder contains the sbatch files to run all the experiments in the paper, after the fine-tunings have been run.

To generate the synthetic datasets, use the scripts in ita_data_generate.
To run the detection experiments with proxy models the scripts to use are in proxy_models_comparisons.
The supervised detection experiments can be found in supervised_detection
For the experiments replicas on the xsum the fine-tuning yamls are available in xsum and the experiments with fine-tuned models are available in xsum_experiments with a similar structure as those for Italian.

An example experiment, after fine-tuning llama on the change-it dataset to generate the synthetic texts one can run

sbatch experiments/ita_data_generate/ita_dat_generate_hf_llama/llama-7b_change_it.sbatch

possibly adjusting the experiment file experiments/ita_data_generate/ita_dat_generate_hf_llama/llama-7b_change_it.sbatch to the fine-tuned model and to the dataset by adjusting the values for

--name-or-path
--modifier-model
--model-name
--data-path

where --name-or-path is the path to the model that generates the synthetic texts, --modifier-model is the path to the model used to create text variations as in DetectGPT (we always use it5 for italian) and --model-path is the path to the model that computes the likelihood.

Instead --data-path is the path to the dataset.

Note
The experiments are meant to be run in a slurm cluster although they should work with minor changes (e.g. removing the srun commands) also in a local machine.

gpucce / synthetic_llm_data

Synthetic LLM data

Model Fine Tuning

Synthetic Text Detection

About

Languages