[Code] CITE: Connecting Image and Text Embeddings

Updated on 2023.12.26

Key Features

This repository provides the official implementation of Text-guided Foundation Model Adaptation for Pathological Image Classification.

Foundation model adaptation to medical imaging analysis
Data-efficient and low-cost visual prompt tuning
Injection of medical in-domain knowledge via text
Compatibility with various foundation models

Details

The recent surge of foundation models in computer vision and natural language processing opens up perspectives in utilizing multi-modal clinical data to train large models with strong generalizability. Yet pathological image datasets often lack biomedical text annotation and enrichment. Guiding data-efficient image diagnosis from the use of biomedical text knowledge becomes a substantial interest. In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification. CITE injects text insights gained from language models pre-trained with a broad range of biomedical texts, leading to adapt foundation models towards pathological image understanding. Through extensive experiments on the PatchGastric stomach tumor pathological image dataset, we demonstrate that CITE achieves leading performance compared with various baselines especially when training data is scarce. CITE offers insights into leveraging in-domain text knowledge to reinforce data-efficient pathological image classification.

An overview of CITE:

Dataset

The PatchGastric dataset includes histopathological image patches extracted from H&E stained whole slide images (WSI) of stomach adenocarcinoma endoscopic biopsy specimens. The dataset contains 9 subtypes of gastric adenocarcinoma WSIs. We choose 3 major subtypes including “well differentiated tubular adenocarcinoma”, “moderately differentiated tubular adenocarcinoma”, and “poorly differentiated adenocarcinoma” to form a 3-class grading-like classification task with 179,285 patches of size 300x300 from 693 WSIs.

To prepare the PatchGastric dataset:

Download captions.csv and patches_captions.zip from PatchGastricADC22.
Put them in data/ and unzip the file.

Get Started

Main Requirements

torch==1.13.0
mmcls==0.25.0
transformers
clip

Installation

conda create -n CITE python=3.9
conda activate CITE
conda install pytorch==1.13.0 torchvision==0.14.0 torchaudio==0.13.0 pytorch-cuda=11.7 -c pytorch -c nvidia
pip install openmim
mim install mmcls==0.25.0
pip install -r requirements.txt

Preprocess

To follow our split of the dataset, please generate the annotation files by running:

python tools/ann.py

Or you can generate your own split following mmcls format:

filename label

Training

The config files follow mmcls style.

PYTHONPATH=.:$PYTHONPATH mim train mmcls <config>

Testing

PYTHONPATH=.:$PYTHONPATH mim test mmcls <config> --checkpoint <checkpoint> --metrics <metrics>

🙋‍♀️ Feedback and Contact

📝 Citation

@inproceedings{zhang2023text,
  title={Text-guided Foundation Model Adaptation for Pathological Image Classification},
  author={Zhang, Yunkun and Gao, Jin and Zhou, Mu and Wang, Xiaosong and Qiao, Yu and Zhang, Shaoting and Wang, Dequan},
  booktitle={MICCAI},
  year={2023}
}

🗃️ Materials

We provide a comprehensive overview of current open-source medical language models, vision foundation models, and vision-language models, illustrating their applicability to our approach (CITE). For BERT-based language models, you may directly replace model->head->text_encoder->model and model->neck->out_features with your preferred Huggingface🤗 model in the config file to run CITE.

Medical Language Models

Model	Subfield	Paper	Code	Base
Meditron	Medicine	Meditron-70B: Scaling Medical Pretraining for Large Language Models	Github	LLaMA 2
RadFM	Radiology	Towards Generalist Foundation Model for Radiology	Github	LLaMA
BioMedGPT	Biomedicine	BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine	Github	LLaMA 2
Med-PaLM 2	Clinic	Towards Expert-Level Medical Question Answering with Large Language Models	Google	PaLM 2
PMC-LLaMA	Medicine	PMC-LLaMA: Towards Building Open-source Language Models for Medicine	Github	LLaMA
BenTsao (HuaTuo)	Biomedicine	HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge	Github	LLaMA
ChatDoctor	Medicine	ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge	Github	LLaMA
Clinical-T5	Clinic	Clinical-T5: Large Language Models Built Using Mimic Clinical Text	PhysioNet	T5
Med-PaLM	Clinic	Large Language Models Encode Clinical Knowledge	Google	PaLM
BioGPT	Biomedicine	BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining	Github	GPT-2
BioLinkBERT	Biomedicine	Linkbert: Pretraining Language Models with Document Links	Github	BERT
PubMedBERT	Biomedicine	Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing	Microsoft	BERT
BioBERT	Biomedicine	BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining	Github	BERT
BlueBERT	Biomedicine	An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining	Github	BERT
Clinical BERT	Clinic	Publicly Available Clinical BERT Embeddings	Github	BERT
SciBERT	Biomedicine	SciBERT: A Pretrained Language Model for Scientific Text	Github	BERT

Vision Models

Model	Subfield	Paper	Code	Base
REMEDIS	Radiology	Robust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic Imaging	Github	SimCLR
RETFound	Retinopathy	A Foundation Model for Generalizable Disease Detection from Retinal Images	Github	MAE
CTransPath	Pathology	Transformer-Based Unsupervised Contrastive Learning for Histopathological Image Classification	Github	-
HIPT	Pathology	Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning	Github	DINO
INTERN-2.5	General	InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	Github	-
DINOv2	General	DINOv2: Learning Robust Visual Features without Supervision	Github	-
MAE	General	Masked Autoencoders are Scalable Vision Learners	Github	-
ViT (ImageNet)	General	An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale	Huggingface	-

Vision-Language Models

Model	Subfield	Paper	Code	Base
Qilin-Med-VL	Radiology	Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare	Github	LLaVA
RadFM	Radiology	Towards Generalist Foundation Model for Radiology	Github	-
KAD	Radiology	Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images	Github	CLIP
Med-Flamingo	Medicine	Med-Flamingo: A Multimodal Medical Few-Shot Learner	Github	Flamingo
QuiltNet	Pathology	Quilt-1M: One Million Image-Text Pairs for Histopathology	Github	CLIP
PLIP	Pathology	A Visual-Language Foundation Model for Pathology Image Analysis Using Medical Twitter	Huggingface	CLIP
MI-Zero	Pathology	Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images	Github	CLIP
LLaVA-Med	Biomedicine	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Github	LLaVA
MedVInT	Biomedicine	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Github	-
PMC-CLIP	Biomedicine	PMC-CLIP: Contrastive Language-Image Pre-Training Using Biomedical Documents	Github	CLIP
BiomedCLIP	Biomedicine	Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing	Huggingface	CLIP
MedCLIP	Medicine	MedCLIP: Contrastive Learning from Unpaired Medical Images and Text	Github	CLIP
CheXzero	Radiology	Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning	Github	CLIP
PubMedCLIP	Radiology	Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?	Github	CLIP
LLaVA	Genearl	Visual Instruction Tuning	Github	-
Flamingo	General	Flamingo: a Visual Language Model for Few-Shot Learning	OpenFlamingo	-
CLIP	General	Learning Transferable Visual Models From Natural Language Supervision	Github	-

Yunkun-Zhang / CITE