ml-gw-search

This repository contains the code used for the MLGWSC-1 mock data challenge 1 in the submission titled TPI FSU Jena and followup projects, authored by Ondřej Zelenka, Bernd Brügmann, and Frank Ohme.

MLGWSC-1 submission

The submission is in the directory mlgwsc-1, and contains

split_noise_file.py: Randomly splits an HDF5 file containing noise (originally meant to be used with real_noise_file.hdf supplied by the MDC) into two, preserving the individual datasets. Not strictly required but useful for generation of training and validation data with real noise.
slice_real_noise.py: Whitens and slices an HDF5 file containing noise (meant for the first output file of the previous script, also can be used with real_noise_file.hdf). Necessary for generation of training and validation data with real noise.
gen.py: Generates training and validation data. Can use Gaussian noise (generated at runtime) or real noise provided through the output of the previous script. Can use spinless IMRPhenomD or generic-spin IMRPhenomXPHM to generate injections.
train.py: Trains the CNN using data generated by the previous script.
apply.py: Applies the CNN trained by the previous script to test data and saves an HDF5 list of events.
best_state_dict.pt: State dictionary with the trained weights submitted to the mock data challenge.
whiten.py: Whitens an HDF5 file of test data. Useful in cases where the same data is analyzed multiple times (e.g. CNN architecture optimization) without chaning the whitening parameters, apply.py accepts whitened data (the --white argument must be supplied).

Minimal usage to reproduce the network training:

python split_noise_file.py <MDC_REPO_PATH>/real_noise_file.hdf rnoise1.hdf rnoise2.hdf
python slice_real_noise.py rnoise1.hdf -o <TRAINING_DATA_PATH> -d 2 --chunk-size 24000
tr_paths=""
for i in {0000..0049}
do
    new_path=<TRAINING_DATA_PATH>/training_data_"$i".hdf 
    tr_paths=$tr_paths" "$new_path
    python gen.py -o $new_path -a IMRPhenomXPHM -d 2 --training-samples 10000 10000 --validation-samples 2000 2000 \
                  --real-noise-file <TRAINING_DATA_PATH>/sliced_noise_"$i".hdf
done
python train.py -d $tr_paths -o <OUTPUT> -s 7. 20. --train-device <TRAIN_DEVICE> --store-device <STORE_DEVICE> \
                --epochs 250 --learning-rate 4.e-6

The training and validation data is split into 50 files due to more practical handling of smaller files and filesize limitations of some file systems. Also, an overall shorter runtime can be achieved by running multiple of the gen.py processes in parallel. One should substitute paths to directories (ideally empty) for <TRAINING_DATA> and <OUTPUT> and either cuda or cpu for <TRAIN_DEVICE> and <STORE_DEVICE>; if a CUDA-compatible GPU is installed and available to PyTorch, cuda can be used for the training device. For the data storage device, cpu is the safer option, but if the GPU has enough VRAM, cuda can also be used for a small performance boost.

After training, the <OUTPUT> directory contains the network states (stored as 'state dictionaries') after each training epoch, as well as best_state_dict.pt after the epoch with the lowest validation loss, and losses.txt with the training and validation loss values throughout the training. Experiments during development suggest that the best performance on test dataset 4 is achieved by choosing one of the local minima of the validation loss which occur earlier during the training than the global minimum. The chosen network is applied to test data to produce events by running:

python apply.py <TEST_INPUT_PATH> <EVENT_OUTPUT_PATH> -w <STATE_DICTIONARY> --device <DEVICE>

For <DEVICE>, one should again use cuda, if available. For computational efficiency, it is also beneficial to specify --num-workers with the number of physical cores.

Evaluation time decomposition

The directory mlgwsc-1/timed contains a modified version of the submission to study evaluation times of different parts of the network, and example histograms.

apply_timed.py: Performs the same task as the apply.py of the submission. In addition, if a filename is supplied through the --times-output argument, the script computes and saves the evaluation times of the convolutional part, the flattening layer, and the fully connected part of the network over the individual batches, as a text file. These are saved in the first three columns of the output file, respectively, and the fourth column contains the sizes of the respective batches.
day_ds<dataset number>, month_ds<dataset number>: Example time outputs of the modified submission applied to the 4 test datasets of lengths one day and one month, respectively. They have been evaluated on the machine used to develop the submission, using a GeForce RTX 3090 GPU. The files are named times_for.txt and times_bac.txt for the foreground and background evaluation, respectively.
hist_day.pdf, hist_month.pdf: Example histograms of the evaluation times above.
plot_hist.py: Script to plot histograms of the times in the apply_timed.py output.

Correction

The directory correction contains the code and results of the corrected search based on the MLGWSC-1 submission presented in 2. This includes:

apply.py: The updated search algorithm, the only difference from mlgwsc-1/apply.py is the removal of the batch normalization layer.
train.py: The updated training script, the only difference from mlgwsc-1/train.py is that it only loads half the waveforms in each supplied training data file. The noise samples meant for injection of the unloaded waveforms are used as pure noise.
state_dicts: Directory containing trained network state dictionaries. Selected from 6 training runs as the most sensitive of each run on dataset 4 at 1 false alarm per month. Files are named R<run number>_<four-digit epoch number>.pt.

Application to O3b data

The directory correction/O3b contains the results of the application of the networks contained in state_dicts to O3b data.

downsample.py: Script used to download O3b data in segments where data is available in sufficient quality in both LIGO detectors. To reproduce the data analyzed in 2, run as

python downsample.py --output <output filename> --minimum-duration 60

events: Events returned by the 6 applied searches at first-level trigger threshold 0. Files are named R<run number>_<four-digit epoch number>.hdf.
specgrams: Q-transform spectrograms of the loudest 128 events returned by the 6 searches. Contains 6 directories with the same naming convention as the event files. Spectrograms are sorted in descending loudness, i.e. specgram_plot_0000.pdf is the loudest event returned by the given search.

Extended mass range experiment

The directory extended_mass contains additional data regarding the experiment of App. A in 2. This includes:

gen.py: Modified training data generation script, differing from that of the MLGWSC-1 submission merely by using the mass range $\left[7M_\odot,~ 50M_\odot\right]$ instead of $\left[10M_\odot,~ 50M_\odot\right]$.
state_dicts: Directory containing trained network state dictionaries. The experiments were trained and states were selected the same way as in the corrected experiment, except for the regenerated dataset, and are named E<run number>_<four-digit epoch number>.pt.
O3b/events: Events from the O3b observing run returned by the 6 searches at first-level trigger threshold 0. Files are named E<run number>_<four-digit epoch number>.pt.

References

[1] M. Schäfer, O. Zelenka, P. Müller, and A. Nitz, gwastro/ml-mock-data-challenge-1: MLGWSC-1 Release v1.2 (2021).

[2] O. Zelenka. "Applications of Machine Learning to Gravitational Waves". PhD thesis. Friedrich-Schiller-Universität Jena, 2023.

ondrzel / ml-gw-search