passaglia / AstroCLIP

Multimodal contrastive pretraining for astronomical data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AstroCLIP

Multimodal contrastive pretraining for astronomical data

The goal of this project is to demonstrate the ability of contrastive pre-training between two different kinds of astronomical data modalities (multi-band imaging, and optical spectra), to yield a meaningful embedding space which captures physical information about galaxies and is shared between both modalities.

image

Results

We encourage you to take a look at our NeurIPS 2023 AI4Science submission (still under review) for a longer form description of our results, but here are the main takeaways:

  • Both image and spectra encoders are able to extract meaningful physical information from the input data.
  • The embeddings of both images and spectra are well aligned, allowing us to retrieve spectra that correspond to a given image, and vice-versa.

The notebook used to generate the plots of the paper can be found here.

Below is a visualization of the learned embeddings, by taking the 2 first PCA components of spectra and image embeddings. As one can see, images and spectra discover similar main factors of variations. emb_pca

Visualizing the structure of the latent space by UMAP dimensionality reduction further higlights some of its information content. Below is an example of a UMAP of the spectra embeddings:

image

Products: Datasets and Trained Models

Dataset

As part of this project, we compile and make available a combined dataset of DESI Legacy Survey g,r,z images, and DESI Early Data Release spectra. These images are a subset of the ssl-legacysurvey sample compiled by @georgestein from the Legacy Survey DR9. Scripts used to match these datasets are available here.

For convenience, we provide a Hugging Face Datasets loading script which will automatically download the data needed and prepare the dataset on your computer.

from datasets import load_dataset

# This downloads about 60 GB of data
dset = load_dataset('astroclip/datasets/legacy_survey.py')

For an example of getting started with this dataset, for example to simply predict redsfhit from the spectra, you can take a look at this notebook notebook.

Training scripts and model weights

[Coming soon]

Requirements

This repo should only have basic pytorch and huggingface requirements. The following should install all that is needed (when run from this repository):

pip install .

About

Multimodal contrastive pretraining for astronomical data

License:MIT License


Languages

Language:Jupyter Notebook 100.0%Language:Python 0.0%