Yutong-Zhou-cv

Sanctuary's starred repositories

Awesome-Scientific-Language-Models

A Curated List of Language Models in Scientific Domains

MIT28800

SEED-Bench

(CVPR2024)A benchmark for evaluating Multimodal LLMs using multiple-choice questions.

Language:PythonNOASSERTION26100

FoE-ICLR2024

The implementation of FoE for ICLR 2024

Language:Python200

clip_dinoiser

Official implementation of 'CLIP-DINOiser: Teaching CLIP a few DINO tricks' paper.

Language:Jupyter NotebookApache-2.014100

rscir

Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper: "Composed Image Retrieval for Remote Sensing"

Language:PythonApache-2.04400

Recommendations-Diffusion-Text-Image

A paper collection of recent diffusion models for text-image generation tasks, e,g., visual text generation, font generation, text removal, text image super resolution, text editing, handwritten generation, scene text recognition and scene text detection.

14600

SEED-X

Multimodal Models in Real World

Language:Jupyter NotebookNOASSERTION27700

Vlogger

[CVPR2024] Make Your Dream A Vlog

Language:PythonApache-2.038100

bioclip

This is the repository for the BioCLIP model and the TreeOfLife-10M dataset [CVPR'24 Oral].

Language:PythonNOASSERTION7200

UDR-Mixer

Towards Ultra-High-Definition Image Deraining: A Benchmark and An Efficient Method

Language:Python1800

MathBench

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

Apache-2.06100

pero-pretraining

OCR self-supervised pretraining for paper Kišš et al.: Self-supervised pretraining for text recognition.

Language:PythonBSD-2-Clause400

OOTDiffusion

Official implementation of OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

Language:PythonNOASSERTION493700

MM-VUFM4DS

A systematic survey of multi-modal and multi-task visual understanding foundation models for driving scenarios

2800

MM-Retinal

Language:Python2800

FG-2024-Papers

FG 2024 Papers: Explore a comprehensive collection of research papers presented at one of the premier conferences on automatic face and gesture recognition. Seamlessly integrate code implementations for better understanding. ⭐ Experience the cutting edge of progress in facial analysis, gesture recognition, and biometrics with this repository!

MIT600

Grounding-DINO-1.5-API

API for Grounding DINO 1.5: IDEA Research's Most Capable Open-World Object Detection Model Series

Language:PythonApache-2.048600

GroundingDINO

Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

Language:PythonApache-2.0530900

Awesome-Text-to-Image

(ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.

MIT192900

Grounded-Segment-Anything

Grounded SAM: Marrying Grounding DINO with Segment Anything & Stable Diffusion & Recognize Anything - Automatically Detect , Segment and Generate Anything

Language:Jupyter NotebookApache-2.01386900

EdgeSAM

Official PyTorch implementation of "EdgeSAM: Prompt-In-the-Loop Distillation for On-Device Deployment of SAM"

Language:Jupyter NotebookNOASSERTION71400

agricultural_textual_classification_ChatGPT

using ChatGPT to classify textual topics/ categories.

Language:Jupyter Notebook3700

tokenize-anything

Tokenize Anything via Prompting

Language:Jupyter NotebookApache-2.045600

TPD

This is the official repository for the paper "Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On". CVPR 2024

Language:Python3700

flatten

Pytorch Implementation of FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing (ICLR 2024)

Language:PythonApache-2.015500

RAT

Implementation of "RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Horizon Generation".

Language:Python12600

RPG-DiffusionMaster

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (PRG)

Language:Jupyter Notebook156200

vicreg

VICReg official code base

Language:PythonMIT49700

BLINK_Benchmark

This repo contains evaluation code for the paper "BLINK: Multimodal Large Language Models Can See but Not Perceive". https://arxiv.org/abs/2404.12390

Language:PythonApache-2.06700

LLaMA-Factory

Unify Efficient Fine-Tuning of 100+ LLMs

Language:PythonApache-2.02395900