Sanctuary's starred repositories
Awesome-Scientific-Language-Models
A Curated List of Language Models in Scientific Domains
SEED-Bench
(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.
FoE-ICLR2024
The implementation of FoE for ICLR 2024
clip_dinoiser
Official implementation of 'CLIP-DINOiser: Teaching CLIP a few DINO tricks' paper.
Recommendations-Diffusion-Text-Image
A paper collection of recent diffusion models for text-image generation tasks, e,g., visual text generation, font generation, text removal, text image super resolution, text editing, handwritten generation, scene text recognition and scene text detection.
pero-pretraining
OCR self-supervised pretraining for paper Kišš et al.: Self-supervised pretraining for text recognition.
OOTDiffusion
Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on
MM-VUFM4DS
A systematic survey of multi-modal and multi-task visual understanding foundation models for driving scenarios
FG-2024-Papers
FG 2024 Papers: Explore a comprehensive collection of research papers presented at one of the premier conferences on automatic face and gesture recognition. Seamlessly integrate code implementations for better understanding. ⭐ Experience the cutting edge of progress in facial analysis, gesture recognition, and biometrics with this repository!
Grounding-DINO-1.5-API
API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series
GroundingDINO
Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
Awesome-Text-to-Image
(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.
Grounded-Segment-Anything
Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything
agricultural_textual_classification_ChatGPT
using ChatGPT to classify textual topics/ categories.
tokenize-anything
Tokenize Anything via Prompting
RPG-DiffusionMaster
[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)
BLINK_Benchmark
This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390
LLaMA-Factory
Unify Efficient Fine-Tuning of 100+ LLMs