This repository contains the code used for the MLGWSC-1 mock data challenge 1 in the submission titled TPI FSU Jena and followup projects, authored by Ondřej Zelenka, Bernd Brügmann, and Frank Ohme.
The submission is in the directory mlgwsc-1
, and contains
split_noise_file.py
: Randomly splits an HDF5 file containing noise (originally meant to be used withreal_noise_file.hdf
supplied by the MDC) into two, preserving the individual datasets. Not strictly required but useful for generation of training and validation data with real noise.slice_real_noise.py
: Whitens and slices an HDF5 file containing noise (meant for the first output file of the previous script, also can be used withreal_noise_file.hdf
). Necessary for generation of training and validation data with real noise.gen.py
: Generates training and validation data. Can use Gaussian noise (generated at runtime) or real noise provided through the output of the previous script. Can use spinlessIMRPhenomD
or generic-spinIMRPhenomXPHM
to generate injections.train.py
: Trains the CNN using data generated by the previous script.apply.py
: Applies the CNN trained by the previous script to test data and saves an HDF5 list of events.best_state_dict.pt
: State dictionary with the trained weights submitted to the mock data challenge.whiten.py
: Whitens an HDF5 file of test data. Useful in cases where the same data is analyzed multiple times (e.g. CNN architecture optimization) without chaning the whitening parameters,apply.py
accepts whitened data (the--white
argument must be supplied).
Minimal usage to reproduce the network training:
python split_noise_file.py <MDC_REPO_PATH>/real_noise_file.hdf rnoise1.hdf rnoise2.hdf
python slice_real_noise.py rnoise1.hdf -o <TRAINING_DATA_PATH> -d 2 --chunk-size 24000
tr_paths=""
for i in {0000..0049}
do
new_path=<TRAINING_DATA_PATH>/training_data_"$i".hdf
tr_paths=$tr_paths" "$new_path
python gen.py -o $new_path -a IMRPhenomXPHM -d 2 --training-samples 10000 10000 --validation-samples 2000 2000 \
--real-noise-file <TRAINING_DATA_PATH>/sliced_noise_"$i".hdf
done
python train.py -d $tr_paths -o <OUTPUT> -s 7. 20. --train-device <TRAIN_DEVICE> --store-device <STORE_DEVICE> \
--epochs 250 --learning-rate 4.e-6
The training and validation data is split into 50 files due to more practical handling of smaller files and filesize limitations of some file systems. Also, an overall shorter runtime can be achieved by running multiple of the gen.py
processes in parallel. One should substitute paths to directories (ideally empty) for <TRAINING_DATA>
and <OUTPUT>
and either cuda
or cpu
for <TRAIN_DEVICE>
and <STORE_DEVICE>
; if a CUDA-compatible GPU is installed and available to PyTorch, cuda
can be used for the training device. For the data storage device, cpu
is the safer option, but if the GPU has enough VRAM, cuda
can also be used for a small performance boost.
After training, the <OUTPUT>
directory contains the network states (stored as 'state dictionaries') after each training epoch, as well as best_state_dict.pt
after the epoch with the lowest validation loss, and losses.txt
with the training and validation loss values throughout the training. Experiments during development suggest that the best performance on test dataset 4 is achieved by choosing one of the local minima of the validation loss which occur earlier during the training than the global minimum. The chosen network is applied to test data to produce events by running:
python apply.py <TEST_INPUT_PATH> <EVENT_OUTPUT_PATH> -w <STATE_DICTIONARY> --device <DEVICE>
For <DEVICE>
, one should again use cuda
, if available. For computational efficiency, it is also beneficial to specify --num-workers
with the number of physical cores.
The directory mlgwsc-1/timed
contains a modified version of the submission to study evaluation times of different parts of the network, and example histograms.
apply_timed.py
: Performs the same task as theapply.py
of the submission. In addition, if a filename is supplied through the--times-output
argument, the script computes and saves the evaluation times of the convolutional part, the flattening layer, and the fully connected part of the network over the individual batches, as a text file. These are saved in the first three columns of the output file, respectively, and the fourth column contains the sizes of the respective batches.day_ds<dataset number>
,month_ds<dataset number>
: Example time outputs of the modified submission applied to the 4 test datasets of lengths one day and one month, respectively. They have been evaluated on the machine used to develop the submission, using a GeForce RTX 3090 GPU. The files are namedtimes_for.txt
andtimes_bac.txt
for the foreground and background evaluation, respectively.hist_day.pdf
,hist_month.pdf
: Example histograms of the evaluation times above.plot_hist.py
: Script to plot histograms of the times in theapply_timed.py
output.
The directory correction
contains the code and results of the corrected search based on the MLGWSC-1 submission presented in 2. This includes:
apply.py
: The updated search algorithm, the only difference frommlgwsc-1/apply.py
is the removal of the batch normalization layer.train.py
: The updated training script, the only difference frommlgwsc-1/train.py
is that it only loads half the waveforms in each supplied training data file. The noise samples meant for injection of the unloaded waveforms are used as pure noise.state_dicts
: Directory containing trained network state dictionaries. Selected from 6 training runs as the most sensitive of each run on dataset 4 at 1 false alarm per month. Files are namedR<run number>_<four-digit epoch number>.pt
.
The directory correction/O3b
contains the results of the application of the networks contained in state_dicts
to O3b data.
downsample.py
: Script used to download O3b data in segments where data is available in sufficient quality in both LIGO detectors. To reproduce the data analyzed in 2, run as
python downsample.py --output <output filename> --minimum-duration 60
events
: Events returned by the 6 applied searches at first-level trigger threshold 0. Files are namedR<run number>_<four-digit epoch number>.hdf
.specgrams
: Q-transform spectrograms of the loudest 128 events returned by the 6 searches. Contains 6 directories with the same naming convention as the event files. Spectrograms are sorted in descending loudness, i.e.specgram_plot_0000.pdf
is the loudest event returned by the given search.
The directory extended_mass
contains additional data regarding the experiment of App. A in 2. This includes:
-
gen.py
: Modified training data generation script, differing from that of the MLGWSC-1 submission merely by using the mass range$\left[7M_\odot,~ 50M_\odot\right]$ instead of$\left[10M_\odot,~ 50M_\odot\right]$ . -
state_dicts
: Directory containing trained network state dictionaries. The experiments were trained and states were selected the same way as in the corrected experiment, except for the regenerated dataset, and are namedE<run number>_<four-digit epoch number>.pt
. -
O3b/events
: Events from the O3b observing run returned by the 6 searches at first-level trigger threshold 0. Files are namedE<run number>_<four-digit epoch number>.pt
.
[1] M. Schäfer, O. Zelenka, P. Müller, and A. Nitz, gwastro/ml-mock-data-challenge-1: MLGWSC-1 Release v1.2 (2021).
[2] O. Zelenka. "Applications of Machine Learning to Gravitational Waves". PhD thesis. Friedrich-Schiller-Universität Jena, 2023.