Zero Shot Context-Based Object Segmentation using SLIP (SAM+CLIP)
The goal of the project is to enhance the capabilities of the SAM (Segment Anything Model 1) model by incorporating text prompts using CLIP (Contrastive Language-Image Pretraining 2). This integration, known as SLIP (SAM with CLIP), aims to enable object segmentation without the need for prior training on specific classes or categories.
SLIP demo/
zero_shot_finetuned.ipynb
SLIP - Zero shot segmentation demo after finetuning CLIP.zero_shot_pretrained.ipynb
- SLIP - Zero shot segmentation demo using pretrained CLIP.
assests
- Contains images for plots, model architecture, and test images.baseline classifier/
classifier output/
ResNet18_pokemon_output
- text file - output after training ResNet18 on pokemon dataset.VGG_pokemon_output
- text file - output after training VGG on pokemon dataset.
models/
ResNet18.py
- ResNet18 model.VGG.py
- VGG model.
run_resnet.sbatch
- script to train ResNetrun_vgg.sbatch
- script to train vgg
evaluation/
ResNet_eval.ipynb
- ResNet evaluation on pokemon dataset.SLIP_segment_eval.ipynb
SLIP - Evalution of SLIP after finetuning CLIP, on pokemon dataset.make_evalutaion_dataset.py
Creates evaluation dataset.pokedex.csv
Contains information mapping image index to image class.pretrained_eval_segment.ipynb
SLIP - Evalution of SLIP using pretrained CLIP, on pokemon dataset.
finetuned CLIP/
captions.csv
- contains captions for CLIP finetuning.clip_grid_search.py
- Runs grid search on CLIP for hyperparameter tuning.clip_grid_search_output
- contains output after running gridsearch.convert_txt_to_csv.py
- converts captions text file to a csv file.generate_captions.py
- Generates captions for pokemon dataest.run.sbatch
- script for running grid search.
plots/
plot_resnet.ipynb
- plots for resnet.plot_CLIP.ipynb
- plots for CLIP.text_for_plot.txt
- best CLIP model output during grid search.
- Run the cells of the notebooks in
SLIP demo/
Model Architecture | Accuracy |
---|---|
SLIP - pretrained only | 0.15 |
SLIP - finetuned | 0.69 |
- Project report can be found at docs/Report.pdf
[1] Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A. C.; Lo, W.Y.; Doll ́ar, P.; and Girshick, R. 2023. Segment Anything. arXiv:2304.02643.
[2] Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; and Sutskever, I. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020.
[3] Contrastive Language-Image Pre-training
- Arushi Arora: aa10350@nyu.edu
- Saaketh Koundinya : sg7729@nyu.edu
- Shreya Agarwal : sa6981@nyu.edu