CanaryGAN

Pytorch reimplementation of Pagliarini et al. (2021) "What does the Canary Say? Low-Dimensional GAN Applied to Birdsong"

Installation

First, clone the repository on your computer.

We recommend using a virtual environment when using this tool. You may use virtualenv or pyenv for instance. If using conda, pay attention when performing the next steps, as some package requirements may differ.

The code should run with python>=3.9<=3.11.

Using pip (recommended)

After cloning the repository and creating a virtual environment, open a terminal and place yourself at the repository root. Activate your virtual environment (this step may differ from one virtual environment manager to another).

Now, run:

pip install -e .

This will install canarygan along with its dependencies, and add canarygan to your PATH. You will now be able to use canarygan command line interface.

Manually from requirements

In some cases, you might want to install requirements manually. This is required if you need a specific version of Pytorch to run on your machine. Package requirements may be found in the requirements.txt and the pyproject.toml files.

You can install requirements by running the following command within the repository and a virtual environment:

pip install -r requirements.txt

Modify this file, or use pip or conda if you wish to install packages differently.

You may still try to run pip install -e . after this step to add canarygan to your PATH. If it does not work, replace all following invocations of canarygan command line interface with python -m canarygan.

Note on Pytorch

We let torch package requirement pretty loose on purpose, but can not ensure this tool will work on any machine and operating system.

This tool was developed using Pytorch 2.0.3, and runs on different Linux operating systems, equipped with different hardware. It worked using Nvidia GPUs (Quadro 4000TX, P100, A100) with CUDA 11.8.

Command line interface (CLI)

canarygan provides a CLI to perform major operations, such as training the GAN and generating sounds.

You can display a short description of the interface by running:

canarygan --help

You should get the following output:

❯ canarygan --help               
Usage: canarygan [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  build-decoder-dataset  Preprocess dataset for decoder training.
  generate               Generate canary syllables using a trained GAN...
  inception              Compute inception score.
  sample                 Randomly sample GAN latent space and save...
  train-decoders         ESN, kNN and SVM decoders training.
  train-gan              Train a CanaryGAN instance.
  train-inception        Distributed canaryGAN inception scorer training...
  umap                   Make many plots displaying UMAP projections of...``

Note: if you installed canarygan manually, you may have to type python -m canarygan --help instead.

Dataset requirements

Two datasets are required: one to train the GAN and the other to train the decoders.

Both datasets must be WAV files containing 1 second of audio, sampled at 16000Hz at least. If the sampling rate is higher, it will be reduced to 16000Hz automatically. These 1 second of audio must hold a unique birdsong syllable rendition. Original results were obtained over a dataset of 16 different types of syllables, sampled from a single canary individual.

Audio files must be organized in folders named after the canary syllable labels. We recommend that each folder contain the same number of audio samples. In the original paper, 1000 samples per type of syllable were used to train the GAN and the decoders.

The dataset structure hence resembles this:

data_dir/
    |
    |- label_1/
    |   |- ...
    |- label_2/
    |   |- ...
    ...
    |- label_n/
        |- audio_1.wav
        |- audio_2.wav
        |- audio_3.wav
        ...
        |- audio_m.wav

GAN dataset: GAN training dataset must contain real samples only.

Decoder dataset: Decoder training dataset must include GAN training dataset, and append GAN generated songs. In the original work by Pagliarini et al., 5 classes of audio were added: audio samples generated by the GAN at training epochs 15, 30, 45, and last epoch, and white noise samples. These samples were labeled "EARLY15", "EARLY30", "EARLY45", "OT" (Over Training"), and "WN" (White Noise). They may also sometimes be referred to altogether as the "X" class.

1. Train the GAN

GAN training loop was implemented using Lightning. Lightning allows distributed training strategies, using multiple GPUs on multiple compute nodes. However, this training loop should also work locally on a modern powerful computer.

From your terminal, run canarygan train-gan --help to display all options of GAN training.

Training locally

To train the GAN on a single machine equipped with a single GPU or CPU, you may simply launch:

canarygan train-gan -d data_dir/ -s save_dir/

The -d option is used to specify the dataset root directory, which must be structured as previously explained. The -s option specifies the save directory, where all model checkpoints and training logs will be saved during training. If this directory does not exist, it will be created at runtime.

Distributed training

When training in a distributed setup, a bunch of options may be used to attribute compute resources to canarygan.

canarygan train-gan -N 1 -G 2 -c 12 -d data_dir/ -s save_dir/

The -N option defines the number of computing nodes attributed to this GAN training process. This is only useful when training on a cluster. If using a single machine like your personal computer, keep this value to 1.

The -G option defines the number of GPU devices that may be used from training, per node. Here, if we consider training on a machine equipped with 2 GPUs, we set -G to 2.

The -c option sets the number of CPU processes attached to the training loop. This is mainly used to leverage data loading and unloading from and to the GPUs. Here, we launch 12 processes per node.

Distributed training may dramatically speed up training. Using 4 Nvidia P100 GPUs on 2 compute nodes, 1000 epochs of training with a 16000 samples dataset would take approximately 30h.

Logging

By default, logs are written every 100 training steps. Logs may be displayed using Tensorboard:

tensorboard --logdir save_dir/logs/tensorboard

Tensorboard is part of canarygan requirements and will be installed by default on your computer when installing canarygan. Logs are also saved as CSV files in save_dir/logs/csv.

You may change the logging frequency using the --log-every-n-steps option.

Checkpointing

By default, model checkpoints are saved to disk every 15 epochs. You may change the checkpointing frequency using the --save-every-n-epochs option.

Two different sorts of checkpoints are being produced: in save_dir/checkpoints/all, you may find all training checkpoints saved every N epoch, while save_dir/checkpoints/last saves an image of the last checkpoint saved.

Resuming training

The last checkpoint saved may be used to resume training after an interruption, using the --resume flag:

canarygan train-gan -d data_dir/ -s save_dir/ --resume

Versioning

When training several instances and saving them under the same save_dir directory, each instance will be automatically identified by an integer ID, or an ID provided by the user using the --version option.

By default, --version=infer, meaning that instances will be identified by an integer ID that will be automatically incremented when launching a new training process, unless using --resume, which will resume training the last trained instance.

2. Generate syllables

Once a trained GAN instance is available, syllables can be generated by providing latent vectors or randomly sampling the GAN latent space.

Providing latent vectors

We recommend saving the sampled GAN latent space vectors to disk to increase results reproducibility. These vectors must be stored in a $n \times d$ matrix saved as a Numpy archive (.npy), where $n$ is the number of samples and $d$ is the dimension of the GAN latent space (3 by default).

To generate these samples, you may use:

canarygan sample -s save_dir/ -n 10000 -d 3

This will create a .npy file in save_dir/ containing 10000 3-dimensional vectors. By default, the vector values are uniformly distributed between -1 and 1.

You may change the distribution parameters using the --dist and --dist-params options. Run canarygan sample --help to access documentation.

Generate samples

To generate canary syllable samples, run:

canarygan generate -x path/to/gan.ckpt -n 10000 -s save_dir/

The -x option is required and must point to a GAN checkpoint file obtained through training. The -s option is also required and provide an endpoint directory for the generated audios. They will be stored as compressed Numpy archives (.npz) files in this directory. These compressed archives contains the audio signal (in the subfile x.npy) and other metadata such as the corresponding latent vector (z.npy). This archive may be loaded using d = numpy.load(archive_path), and subfiles accessed using d["x"] or d["z"].

The -n option is necessary is you do not wish to provide any pre-computed latent vectors to the script. In that case, this option specifies the number of latent vectors to randomly sample, and thus the number of generated audios.

If you used canarygan sample and wish to generate sounds from precomputed latent vectors, use:

canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy -s save_dir/

The -z flag must point towards the Numpy archive storing the latent vectors on disk.

3. Train decoders

A decoder is used to infer the syllable type of a given audio sample. It is mainly used to classify the production of the GAN and assert its quality in producing realistic bird sounds.

Three lightweight decoders classes are provided in canarygan: Echo State Networks (ESNs), k-Nearest Neighbors (KNN) and Support Vector Machines (SVM). These simple classifiers display good performance at sorting single syllables from the GAN training dataset while remaining simple, fast, and easy to train. They operate on preprocessed representation of audio signals. Several preprocessing methods are available, based on extracting spectral features from the sound. We recommend using deltas method, which consists of computing the first and second derivatives of the audio MFCCs, a compressed time-frequency representation of the sound.

Preparing the training dataset

The decoder training dataset usually contains generated samples from the GAN early training steps. These samples act as a "garbage class", where we expect all poorly realistic sounds to be sorted. Determining these classes may depend on your GAN performance. Notebooks provided in this repository should be used to graphically assess GAN quality.

We recommend using samples from epochs 15, 30, and 45 as garbage examples, as we can safely assume that GAN has not reached convergence in the first 50 epochs of training. In addition, we also added white noise samples to discard samples with too much noise or entropy.

As preprocessing might take time, we also recommend performing data transformations once before training, using:

canarygan build-decoder-dataset -d data_dir/ -s save_dir/

This command will take the dataset in data_dir/ and output its preprocessed representation in save_dir.

Running canarygan build-decoder-dataset --help will display all available preprocessing options. Default options have been set to the one giving the best results in our setup.

The preprocessed dataset will be saved as Numpy archive file storing training and test data (dataset split will occur at this step and may be modified using the --split option), alongside a YAML file where all preprocessing parameters will be saved.

Training the decoders

Once the dataset has been preprocessed, you may run the decoders training loop using:

canarygan train-decoders -s save_dir/ -p preprocessed_dir/

where save_dir/ will hold trained model checkpoints, saved as a pickled file, alongside training and testing metrics, and preprocessed_dir/ holds the preprocessed dataset (Numpy archive and YAML file).

By default, all available decoders will be trained. If you wish to train only a subset of decoders, you can do so by using the -m flag:

canarygan train-decoders -s save_dir/ -p preprocessed_dir/ -m esn -m knn

This will only train an ESN and a KNN decoder.

If you have not preprocessed data beforehand, you may also point the -d flag towards the raw audio dataset, and use all other options to change the preprocessing parameters. This, however, is not recommended.

4. TODO: Decode

After producing some trained decoder, GAN-generated sounds may now be labeled.

5. Generate and decode

You can also generate and decode sounds at the same time by giving decoder checkpoints as input to canarygan generate using the -y option:

canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy -y path/to/decoder1 -y path/to/decoder2 -s save_dir/

This will add some y columns in the generated sound Numpy archive. As preprocessing will happen at the same time as decoding, you may give preprocessing parameters for decoders using the YAML file produced by canarygan build-decoder-dataset through the -p option:

canarygan generate -x path/to/gan.ckpt -z path/to/vectors.npy \
    -y path/to/decoder1 \
    -y path/to/decoder2 \
    -s save_dir/ \
    -p path/to/preprocessing.yml

5. UMAP projection and analysis

To benefit from another point of view on GAN generation quality, UMAP projection can be applied to generated sounds to obtain an unsupervised glimpse into sound plausibility.

canarygan umap -d data_dir/ -g generated_audio_dir/ --epoch x --version y -s save_dir/

Will produce various plots displaying UMAP sound projections of real and generated canary syllables, colored by inferred class. Syllable class will also be computed using HDBSCAN clustering on UMAP projections, and saved into generated sound Numpy archives.

Here, the -d option may be used to point toward the GAN training dataset (containing ground truth samples of syllables), and the -g option must point towards the directory holding all generated sounds (the save_dir of the canarygan generate script). This directory may hold many different generations from different GAN instances at different training epochs. You can choose which version and epoch to plot using the --epoch and --version parameters, where x and y are integers for training epoch and version ID. All plots will then be saved in the save_dir/ directory.

nTrouvain / canarygan