List of Papers

2023

October 2023

Improved Baselines with Visual Instruction Tuning - [2310.03744] [QA].
Aligning Text-to-Image Diffusion Models with Reward Backpropagation - [2310.03739] [QA].
Leveraging Unpaired Data for Vision-Language Generative Models via Cycle Consistency - [2310.03734] [QA].
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning - [2310.03731] [QA].
HeaP: Hierarchical Policies for Web Actions using LLMs - [2310.03720] [QA].
A Long Way to Go: Investigating Length Correlations in RLHF - [2310.03716] [QA].
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines - [2310.03714] [QA].
Agent Instructs Large Language Models to be General Zero-Shot Reasoners - [2310.03710] [QA].
Drag View: Generalizable Novel View Synthesis with Unposed Imagery - [2310.03704] [QA].
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion - [2310.03502] [QA].
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation - [2310.03214] [QA].
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning - [2310.03094] [QA].
Retrieval meets Long Context Large Language Models - [2310.03025] [QA].
How FaR Are Large Language Models From Agents with Theory-of-Mind? - [2310.03051] [QA].
EcoAssistant: Using LLM Assistant More Affordably and Accurately - [2310.03046] [QA].
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts - [2310.02255] [QA].
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens - [2310.02239] [QA].
Think before you speak: Training Language Models With Pause Tokens - [2310.02226] [QA].
What do we learn from a large-scale study of pre-trained visual representations in sim and real environments? - [2310.02219] [QA].
Language Models Represent Space and Time - [2310.02207] [QA].
Large Language Models Cannot Self-Correct Reasoning Yet - [2310.01798] [QA].
Can large language models provide useful feedback on research papers? A large-scale empirical analysis - [2310.01783] [QA].
ImageNet-OOD: Deciphering Modern Out-of-Distribution Detection Algorithms - [2310.01755] [QA].
Large Language Models as Analogical Reasoners - [2310.01714] [QA].
ImagenHub: Standardizing the evaluation of conditional image generation models - [2310.01596] [QA].
SmartPlay : A Benchmark for LLMs as Intelligent Agents - [2310.01557] [QA].
Neutrinos from muon-rich ultra high energy electromagnetic cascades: The MUNHECA code - [2310.01510] [QA].
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model - [2310.01412] [QA].
Conditional Diffusion Distillation - [2310.01407] [QA].
Representation Engineering: A Top-Down Approach to AI Transparency - [2310.01405] [QA].
RA-DIT: Retrieval-Augmented Dual Instruction Tuning - [2310.01352] [QA].
Label Supervised LLaMA Finetuning - [2310.01208] [QA].
Enable Language Models to Implicitly Learn Self-Improvement From Data - [2310.00898] [QA].
(Dynamic) Prompting might be all you need to repair Compressed LLMs - [2310.00867] [QA].
Analyzing and Mitigating Object Hallucination in Large Vision-Language Models - [2310.00754] [QA].
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models - [2310.00746] [QA].
FELM: Benchmarking Factuality Evaluation of Large Language Models - [2310.00741] [QA].
UniAudio: An Audio Foundation Model Toward Universal Audio Generation - [2310.00704] [QA].

September 2023

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis - [2310.00426] [QA].
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ - [2310.00367] [QA].
Efficient Streaming Language Models with Attention Sinks - [2309.17453] [QA].
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision) - [2309.17421] [QA].
Directly Fine-Tuning Diffusion Models on Differentiable Rewards - [2309.17400] [QA].
GAIA-1: A Generative World Model for Autonomous Driving - [2309.17080] [QA].
Demystifying CLIP Data - [2309.16671] [QA].
RealFill: Reference-Driven Generation for Authentic Image Completion - [2309.16668] [QA].
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation - [2309.16653] [QA].
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning - [2309.16650] [QA].
Deep Geometrized Cartoon Line Inbetweening - [2309.16643] [QA].
Qwen Technical Report - [2309.16609] [QA].
Vision Transformers Need Registers - [2309.16588] [QA].
Text-to-3D using Gaussian Splatting - [2309.16585] [QA].
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond - [2309.16583] [QA].
MotionLM: Multi-Agent Motion Forecasting as Language Modeling - [2309.16534] [QA].
CCEdit: Creative and Controllable Video Editing via Diffusion Models - [2309.16496] [QA].
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation - [2309.16429] [QA].
AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models - [2309.16414] [QA].
Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning - [2309.16351] [QA].
Language models in molecular discovery - [2309.16235] [QA].
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model - [2309.16058] [QA].
Effective Long-Context Scaling of Foundation Models - [2309.16039] [QA].
Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation - [2309.15818] [QA].
Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack - [2309.15807] [QA].
Aperture Diffraction for Compact Snapshot Spectral Imaging - [2309.16372] [QA].
Borges and AI - [2310.01425] [QA].
Jointly Training Large Autoregressive Multimodal Models - [2309.15564] [QA].
Finite Scalar Quantization: VQ-VAE Made Simple - [2309.15505] [QA].
Graph Neural Prompting with Large Language Models - [2309.15427] [QA].
NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions - [2309.15426] [QA].
DECO: Dense Estimation of 3D Human-Scene Contact In The Wild - [2309.15273] [QA].
VPA: Fully Test-Time Visual Prompt Adaptation - [2309.15251] [QA].
Low-rank Adaptation of Large Language Model Rescoring for Parameter-Efficient Speech Recognition - [2309.15223] [QA].
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models - [2309.15103] [QA].
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models - [2309.15098] [QA].
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning - [2309.15091] [QA].
RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation - [2309.15082] [QA].
Large Language Model Alignment: A Survey - [2309.15025] [QA].
Treating Motion as Option with Output Selection for Unsupervised Video Object Segmentation - [2309.14786] [QA].
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models - [2309.14717] [QA].
NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space - [2309.14616] [QA].
Efficient Post-training Quantization with FP8 Formats - [2309.14592] [QA].
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss - [2309.14580] [QA].
Aligning Large Multimodal Models with Factually Augmented RLHF - [2309.14525] [QA].
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models - [2309.14509] [QA].
Extreme Parkour with Legged Robots - [2309.14341] [QA].
Electronic properties, correlated topology and Green's function zeros - [2309.14340] [QA].
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention - [2309.14327] [QA].
Physics of Language Models: Part 3.2, Knowledge Manipulation - [2309.14402] [QA].
Small-scale proxies for large-scale Transformer training instabilities - [2309.14322] [QA].
Tiled Multiplane Images for Practical 3D Photography - [2309.14291] [QA].
Only 5% Attention Is All You Need: Efficient Long-range Document-level Neural Machine Translation - [2309.14174] [QA].
May I Ask a Follow-up Question? Understanding the Benefits of Conversations in Neural Network Explainability - [2309.13965] [QA].
VidChapters-7M: Video Chapters at Scale - [2309.13952] [QA].
Impact of Human-AI Interaction on User Trust and Reliance in AI-Assisted Qualitative Coding - [2309.13858] [QA].
Evaluating Cognitive Maps and Planning in Large Language Models with CogEval - [2309.15129] [QA].
Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve - [2309.13638] [QA].
LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning - [2309.13556] [QA].
MediViSTA-SAM: Zero-shot Medical Video Analysis with Spatio-temporal SAM Adaptation - [2309.13539] [QA].
Attention Is All You Need For Blind Room Volume Estimation - [2309.13504] [QA].
Learning Invariant Representations with a Nonparametric Nadaraya-Watson Head - [2309.13377] [QA].
MLPST: MLP is All You Need for Spatio-Temporal Prediction - [2309.13363] [QA].
Exploring Large Language Models' Cognitive Moral Development through Defining Issues Test - [2309.13356] [QA].
Enhancing Zero-Shot Chain-of-Thought Reasoning in Large Language Models through Logic - [2309.13339] [QA].
Calibrating LLM-Based Evaluator - [2309.13308] [QA].
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks - [2309.13256] [QA].
Spatial-frequency channels, shape bias, and adversarial robustness - [2309.13190] [QA].
E(2)-Equivariant Graph Planning for Navigation - [2309.13043] [QA].
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation - [2309.13042] [QA].
Robotic Offline RL from Internet Videos via Value-Function Pre-Training - [2309.13041] [QA].
NeRRF: 3D Reconstruction and View Synthesis for Transparent and Specular Objects with Neural Refractive-Reflective Fields - [2309.13039] [QA].
Privacy Assessment on Reconstructed Images: Are Existing Evaluation Metrics Faithful to Human Perception? - [2309.13038] [QA].
GELLO: A General, Low-Cost, and Intuitive Teleoperation Framework for Robot Manipulators - [2309.13037] [QA].
PyPose v0.6: The Imperative Programming Interface for Robotics - [2309.13035] [QA].
Memory-augmented conformer for improved end-to-end long-form ASR - [2309.13029] [QA].
Graph Neural Network for Stress Predictions in Stiffened Panels Under Uniform Loading - [2309.13022] [QA].
A Hybrid Deep Learning-based Approach for Optimal Genotype by Environment Selection - [2309.13021] [QA].
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model - [2309.13018] [QA].
Understanding Deep Gradient Leakage via Inversion Influence Functions - [2309.13016] [QA].
Efficient N:M Sparse DNN Training Using Algorithm, Architecture, and Dataflow Co-Design - [2309.13015] [QA].
Performance Analysis of UNet and Variants for Medical Image Segmentation - [2309.13013] [QA].
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs - [2309.13007] [QA].
Deep3DSketch+: Rapid 3D Modeling from Single Free-hand Sketches - [2309.13006] [QA].
Pursuing Counterfactual Fairness via Sequential Autoencoder Across Domains - [2309.13005] [QA].
Expressive variational quantum circuits provide inherent privacy in federated learning - [2309.13002] [QA].
Audience-specific Explanations for Machine Translation - [2309.12998] [QA].
Point Cloud Network: An Order of Magnitude Improvement in Linear Layer Parameter Count - [2309.12996] [QA].
Deep learning probability flows and entropy production rates in active matter - [2309.12991] [QA].
License Plate Recognition Based On Multi-Angle View Model - [2309.12972] [QA].
Higher-order Graph Convolutional Network with Flower-Petals Laplacians on Simplicial Complexes - [2309.12971] [QA].
PI-RADS v2 Compliant Automated Segmentation of Prostate Zones Using co-training Motivated Multi-task Dual-Path CNN - [2309.12970] [QA].
Detect Every Thing with Few Examples - [2309.12969] [QA].
Nested Event Extraction upon Pivot Element Recogniton - [2309.12960] [QA].
On Data Fabrication in Collaborative Vehicular Perception: Attacks and Countermeasures - [2309.12955] [QA].
Background Activation Suppression for Weakly Supervised Object Localization and Semantic Segmentation - [2309.12943] [QA].
Trusta: Reasoning about Assurance Cases with Formal Methods and Large Language Models - [2309.12941] [QA].
Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models - [2309.12940] [QA].
Frustrated with Code Quality Issues? LLMs can Help! - [2309.12938] [QA].
Evolving Spiking Neural Networks to Mimic PID Control for Autonomous Blimps - [2309.12937] [QA].
TopRoBERTa: Topology-Aware Authorship Attribution of Deepfake Texts - [2309.12934] [QA].
CodePlan: Repository-level Coding using LLMs and Planning - [2309.12499] [QA].
DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion - [2309.12424] [QA].
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent - [2309.12311] [QA].
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models - [2309.12307] [QA].
PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation - [2309.12303] [QA].
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" - [2309.12288] [QA].
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models - [2309.12284] [QA].
Boolformer: Symbolic Regression of Logic Functions with Transformers - [2309.12207] [QA].
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset - [2309.11998] [QA].
MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion - [2309.11847] [QA].
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models - [2309.11674] [QA].
BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model - [2309.11568] [QA].
A Large-scale Dataset for Audio-Language Representation Learning - [2309.11500] [QA].
DreamLLM: Synergistic Multimodal Comprehension and Creation - [2309.11499] [QA].
FreeU: Free Lunch in Diffusion U-Net - [2309.11497] [QA].
Chain-of-Verification Reduces Hallucination in Large Language Models - [2309.11495] [QA].
SCREWS: A Modular Framework for Reasoning with Revisions - [2309.13075] [QA].
Kosmos-2.5: A Multimodal Literate Model - [2309.11419] [QA].
OpenChat: Advancing Open-source Language Models with Mixed-Quality Data - [2309.11235] [QA].
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute - [2309.11197] [QA].
AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration - [2309.11170] [QA].
Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation - [2309.11160] [QA].
More complex encoder is not all you need - [2309.11139] [QA].
Contrastive Pseudo Learning for Open-World DeepFake Attribution - [2309.11132] [QA].
Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation - [2309.11081] [QA].
Weak Supervision for Label Efficient Visual Bug Detection - [2309.11077] [QA].
The Topology and Geometry of Neural Representations - [2309.11028] [QA].
Controllable Dynamic Appearance for Neural 3D Portraits - [2309.11009] [QA].
RMT: Retentive Networks Meet Vision Transformers - [2309.11523] [QA].
LMDX: Language Model-based Document Information Extraction and Localization - [2309.10952] [QA].
End-to-End Speech Recognition Contextualization with Large Language Models - [2309.10917] [QA].
SlimPajama-DC: Understanding Data Combinations for LLM Training - [2309.10818] [QA].
Sound Source Localization is All about Cross-Modal Alignment - [2309.10724] [QA].
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch - [2309.10706] [QA].
Language Modeling Is Compression - [2309.10668] [QA].
NDDepth: Normal-Distance Assisted Monocular Depth Estimation - [2309.10592] [QA].
FoleyGen: Visually-Guided Audio Generation - [2309.10537] [QA].
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration - [2309.10438] [QA].
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training - [2309.10400] [QA].
Baichuan 2: Open Large-scale Language Models - [2309.10305] [QA].
360$^\circ$ Reconstruction From a Single Image Using Space Carved Outpainting - [2309.10279] [QA].
Stabilizing RLHF through Advantage Model and Selective Rehearsal - [2309.10202] [QA].
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions - [2309.10150] [QA].
Unified Coarse-to-Fine Alignment for Video-Text Retrieval - [2309.10091] [QA].
Multimodal Foundation Models: From Specialists to General-Purpose Assistants - [2309.10020] [QA].
MindAgent: Emergent Gaming Interaction - [2309.09971] [QA].
Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees - [2309.09968] [QA].
An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models - [2309.09958] [QA].
Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering - [2309.09724] [QA].
CATR: Combinatorial-Dependence Audio-Queried Transformer for Audio-Visual Video Segmentation - [2309.09709] [QA].
Adapting Large Language Models via Reading Comprehension - [2309.09530] [QA].
LayoutNUWA: Revealing the Hidden Layout Expertise of Large Language Models - [2309.09506] [QA].
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation - [2309.09501] [QA].
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages - [2309.09400] [QA].
Augmenting text for spoken language understanding with Large Language Models - [2309.09390] [QA].
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles - [2309.09369] [QA].
OWL: A Large Language Model for IT Operations - [2309.09298] [QA].
LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation - [2309.09294] [QA].
Contrastive Decoding Improves Reasoning in Large Language Models - [2309.09117] [QA].
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT) - [2309.08968] [QA].
Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data? - [2309.08963] [QA].
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca - [2309.08958] [QA].
PDFTriage: Question Answering over Long, Structured Documents - [2309.08872] [QA].
S3-DST: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of LLMs - [2309.08827] [QA].
Stack-and-Delay: a new codebook pattern for music generation - [2309.08804] [QA].
Enhance audio generation controllability through representation similarity regularization - [2309.08773] [QA].
BANSAC: A dynamic BAyesian Network for adaptive SAmple Consensus - [2309.08690] [QA].
Sparse Autoencoders Find Highly Interpretable Features in Language Models - [2309.08600] [QA].
Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes - [2309.08588] [QA].
Compositional Foundation Models for Hierarchical Planning - [2309.08587] [QA].
Replacing softmax with ReLU in Vision Transformers - [2309.08586] [QA].
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers - [2309.08532] [QA].
Scaling Laws for Sparsely-Connected Foundation Models - [2309.08520] [QA].
Using Large Language Models for Knowledge Engineering (LLMKE): A Case Study on Wikidata - [2309.08491] [QA].
Deformable Neural Radiance Fields using RGB and Event Cameras - [2309.08416] [QA].
Cure the headache of Transformers via Collinear Constrained Attention - [2309.08646] [QA].
Investigating Answerability of LLMs for Long-Form Question Answering - [2309.08210] [QA].
LASER: LLM Agent with State-Space Exploration for Web Navigation - [2309.08172] [QA].
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding - [2309.08168] [QA].
RADE: Reference-Assisted Dialogue Evaluation for Open-Domain Dialogue - [2309.08156] [QA].
Retrieval-Augmented Text-to-Audio Generation - [2309.08051] [QA].
Leveraging Contextual Information for Effective Entity Salience Detection - [2309.07990] [QA].
Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models - [2309.07986] [QA].
A Data Source for Reasoning Embodied Agents - [2309.07974] [QA].
Language Embedded Radiance Fields for Zero-Shot Task-Oriented Grasping - [2309.07970] [QA].
ALWOD: Active Learning for Weakly-Supervised Object Detection - [2309.07914] [QA].
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning - [2309.07911] [QA].
TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting - [2309.07910] [QA].
Generative Image Dynamics - [2309.07906] [QA].
Ambiguity-Aware In-Context Learning with Large Language Models - [2309.07900] [QA].
Agents: An Open-source Framework for Autonomous Language Agents - [2309.07870] [QA].
The Rise and Potential of Large Language Model Based Agents: A Survey - [2309.07864] [QA].
TextBind: Multi-turn Interleaved Multimodal Instruction-following - [2309.08637] [QA].
OmnimatteRF: Robust Omnimatte with 3D Background Modeling - [2309.07749] [QA].
Efficiently Robustify Pre-trained Models - [2309.07499] [QA].
EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization - [2309.07471] [QA].
Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation? - [2309.07462] [QA].
Clinical Text Summarization: Adapting Large Language Models Can Outperform Human Experts - [2309.07430] [QA].
Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance - [2309.07403] [QA].
AudioSR: Versatile Audio Super-resolution at Scale - [2309.07314] [QA].
Pretraining on the Test Set Is All You Need - [2309.08632] [QA].
All you need is spin: SU(2) equivariant variational quantum circuits based on spin networks - [2309.07250] [QA].
Text-Guided Generation and Editing of Compositional 3D Avatars - [2309.07125] [QA].
RAIN: Your Language Models Can Align Themselves without Finetuning - [2309.07124] [QA].
Tree-Structured Shading Decomposition - [2309.07122] [QA].
SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection - [2309.07084] [QA].
Efficient Reinforcement Learning for Jumping Monopods - [2309.07038] [QA].
DreamStyler: Paint by Style Inversion with Text-to-Image Diffusion Models - [2309.06933] [QA].
MagiCapture: High-Resolution Multi-Concept Portrait Customization - [2309.06895] [QA].
Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit? - [2309.06891] [QA].
Leveraging SE(3) Equivariance for Learning 3D Geometric Shape Assembly - [2309.06810] [QA].
Dynamic NeRFs for Soccer Scenes - [2309.06802] [QA].
Cognitive Mirage: A Review of Hallucinations in Large Language Models - [2309.06794] [QA].
MPI-Flow: Learning Realistic Optical Flow with Multiplane Images - [2309.06714] [QA].
VLSlice: Interactive Vision-and-Language Slice Discovery - [2309.06703] [QA].
Generalizable Neural Fields as Partially Observed Neural Processes - [2309.06660] [QA].
Statistical Rejection Sampling Improves Preference Optimization - [2309.06657] [QA].
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale - [2309.06497] [QA].
Learning Disentangled Avatars with Hybrid 3D Representations - [2309.06441] [QA].
LEAP Hand: Low-Cost, Efficient, and Anthropomorphic Hand for Robot Learning - [2309.06440] [QA].
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation - [2309.06380] [QA].
Recovering from Privacy-Preserving Masking with Large Language Models - [2309.08628] [QA].
Modality Unifying Network for Visible-Infrared Person Re-Identification - [2309.06262] [QA].
Efficient Memory Management for Large Language Model Serving with PagedAttention - [2309.06180] [QA].
AstroLLaMA: Towards Specialized Foundation Models in Astronomy - [2309.06126] [QA].
Uncovering mesa-optimization algorithms in Transformers - [2309.05858] [QA].
Large Language Models for Compiler Optimization - [2309.07062] [QA].
SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors - [2309.05810] [QA].
PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models - [2309.05793] [QA].
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips - [2309.05663] [QA].
Large Language Model for Science: A Study on P vs. NP - [2309.05689] [QA].
UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase - [2309.05573] [QA].
ITI-GEN: Inclusive Text-to-Image Generation - [2309.05569] [QA].
NExT-GPT: Any-to-Any Multimodal LLM - [2309.05519] [QA].
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs - [2309.05516] [QA].
Textbooks Are All You Need II: phi-1.5 technical report - [2309.05463] [QA].
Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning - [2309.05444] [QA].
Class-Incremental Grouping Network for Continual Audio-Visual Learning - [2309.05281] [QA].
Multi3DRefer: Grounding Text Description to Multiple 3D Objects - [2309.05251] [QA].
Does Writing with Language Models Reduce Content Diversity? - [2309.05196] [QA].
Towards Viewpoint Robustness in Bird's Eye View Segmentation - [2309.05192] [QA].
Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color - [2309.05148] [QA].
3D Implicit Transporter for Temporally Consistent Keypoint Discovery - [2309.05098] [QA].
Multi-view Self-supervised Disentanglement for General Image Denoising - [2309.05049] [QA].
Mitigating Word Bias in Zero-shot Prompt-based Classifiers - [2309.04992] [QA].
Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation - [2309.04946] [QA].
Effective Real Image Editing with Accelerated Iterative Diffusion Inversion - [2309.04907] [QA].
Leveraging Large Language Models for Exploiting ASR Uncertainty - [2309.04842] [QA].
Neurons in Large Language Models: Dead, N-gram, Positional - [2309.04827] [QA].
Towards Real-World Burst Image Super-Resolution: Benchmark and Method - [2309.04803] [QA].
VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis - [2309.04800] [QA].
Towards Robust Model Watermark via Reducing Parametric Vulnerability - [2309.04777] [QA].
SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning - [2309.04766] [QA].
When to Learn What: Model-Adaptive Data Augmentation Curriculum - [2309.04747] [QA].
FIAT: Fusing learning paradigms with Instruction-Accelerated Tuning - [2309.04663] [QA].
MADLAD-400: A Multilingual And Document-Level Large Audited Dataset - [2309.04662] [QA].
Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf - [2309.04658] [QA].
Dynamic Mesh-Aware Radiance Fields - [2309.04581] [QA].
When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale - [2309.04564] [QA].
Examining Autoexposure for Challenging Scenes - [2309.04542] [QA].
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving - [2309.04422] [QA].
DeformToon3D: Deformable 3D Toonification from Neural Radiance Fields - [2309.04410] [QA].
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts - [2309.04354] [QA].
The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion - [2309.04509] [QA].
From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting - [2309.04269] [QA].
Towards Practical Capture of High-Fidelity Relightable Avatars - [2309.04247] [QA].
Unsupervised Object Localization with Representer Point Selection - [2309.04172] [QA].
NESTLE: a No-Code Tool for Statistical Analysis of Legal Corpus - [2309.04146] [QA].
Evaluation and Mitigation of Agnosia in Multimodal Large Language Models - [2309.04041] [QA].
CDFSL-V: Cross-Domain Few-Shot Learning for Videos - [2309.03989] [QA].
LanSER: Language-Model Supported Speech Emotion Recognition - [2309.03978] [QA].
ImageBind-LLM: Multi-modality Instruction Tuning - [2309.03905] [QA].
Tracking Anything with Decoupled Video Segmentation - [2309.03903] [QA].
Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction - [2309.03900] [QA].
The Making and Breaking of Camouflage - [2309.03899] [QA].
ProPainter: Improving Propagation and Transformer for Video Inpainting - [2309.03897] [QA].
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks - [2309.03895] [QA].
A Function Interpretation Benchmark for Evaluating Interpretability Methods - [2309.03886] [QA].
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models - [2309.03883] [QA].
On Large Language Models' Selection Bias in Multi-Choice Questions - [2309.03882] [QA].
FLM-101B: An Open LLM and How to Train It with $100K Budget - [2309.03852] [QA].
Panoramas from Photons - [2309.03811] [QA].
SimNP: Learning Self-Similarity Priors Between Neural Points - [2309.03809] [QA].
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption - [2309.03729] [QA].
Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory - [2309.03696] [QA].
Large-Scale Automatic Audiobook Creation - [2309.03926] [QA].
Evaluating ChatGPT as a Recommender System: A Rigorous Approach - [2309.03613] [QA].
Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning - [2309.03598] [QA].
Text2Control3D: Controllable 3D Avatar Generation in Neural Radiance Fields using Geometry-Guided Text-to-Image Diffusion Model - [2309.03550] [QA].
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation - [2309.03549] [QA].
Temporal Collection and Distribution for Referring Video Object Segmentation - [2309.03473] [QA].
SyncDreamer: Generating Multiview-consistent Images from a Single-view Image - [2309.03453] [QA].
Large Language Models as Optimizers - [2309.03409] [QA].
Distribution-Aware Prompt Tuning for Vision-Language Models - [2309.03406] [QA].
Robotic Table Tennis: A Case Study into a High Speed Learning System - [2309.03315] [QA].
Matcha-TTS: A fast TTS architecture with conditional flow matching - [2309.03199] [QA].
Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields - [2309.03185] [QA].
SLiMe: Segment Like Me - [2309.03179] [QA].
ResFields: Residual Neural Fields for Spatiotemporal Signals - [2309.03160] [QA].
MyoDex: A Generalizable Prior for Dexterous Manipulation - [2309.03130] [QA].
Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction - [2309.02965] [QA].
GPT Can Solve Mathematical Problems Without a Calculator - [2309.03241] [QA].
Zero-Resource Hallucination Prevention for Large Language Models - [2309.02654] [QA].
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning - [2309.02591] [QA].
Physically Grounded Vision-Language Models for Robotic Manipulation - [2309.02561] [QA].
A skeletonization algorithm for gradient-based optimization - [2309.02527] [QA].
GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction - [2309.02436] [QA].
Building a Winning Team: Selecting Source Model Ensembles using a Submodular Transferability Estimation Approach - [2309.02429] [QA].
Cognitive Architectures for Language Agents - [2309.02427] [QA].
EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding - [2309.02423] [QA].
Doppelgangers: Learning to Disambiguate Images of Similar Structures - [2309.02420] [QA].
Generating Realistic Images from In-the-wild Sounds - [2309.02405] [QA].
Prototype-based Dataset Comparison - [2309.02401] [QA].
Explaining grokking through circuit efficiency - [2309.02390] [QA].
CIEM: Contrastive Instruction Evaluation Method for Better Instruction Tuning - [2309.02301] [QA].
Making Large Language Models Better Reasoners with Alignment - [2309.02144] [QA].
Multi-label affordance mapping from egocentric vision - [2309.02120] [QA].
Iterative Superquadric Recomposition of 3D Objects from Multiple Views - [2309.02102] [QA].
Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples - [2309.02041] [QA].
Data-Juicer: A One-Stop Data Processing System for Large Language Models - [2309.02033] [QA].
RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image - [2309.02020] [QA].
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning - [2309.01961] [QA].
Empowering Low-Light Image Enhancer through Customized Learnable Priors - [2309.01958] [QA].
Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations - [2309.01858] [QA].
One Wide Feedforward is All You Need - [2309.01826] [QA].
Are Emergent Abilities in Large Language Models just In-Context Learning? - [2309.01809] [QA].
An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports - [2309.01740] [QA].
Mask-Attention-Free Transformer for 3D Instance Segmentation - [2309.01692] [QA].
AGG-Net: Attention Guided Gated-convolutional Network for Depth Image Completion - [2309.01624] [QA].
Raw Data Is All You Need: Virtual Axle Detector with Enhanced Receptive Field - [2309.01574] [QA].
A Blackbox Model Is All You Need to Breach Privacy: Smart Grid Forecasting Models as a Use Case - [2309.01523] [QA].
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification - [2309.01420] [QA].
Memory augment is All You Need for image restoration - [2309.01377] [QA].
EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity - [2309.01296] [QA].
SOAR: Scene-debiasing Open-set Action Recognition - [2309.01265] [QA].
Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning - [2309.01246] [QA].
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models - [2309.01155] [QA].
EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment - [2309.01151] [QA].
Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration - [2309.01131] [QA].
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection - [2309.01093] [QA].
Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning - [2309.01083] [QA].
ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models - [2309.00986] [QA].
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models - [2309.00964] [QA].
Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-supervised Depth Estimation - [2309.00933] [QA].
Domain Generalization via Balancing Training Difficulty and Model Capability - [2309.00844] [QA].
Few shot font generation via transferring similarity guided global style and quantization local style - [2309.00827] [QA].
Instability of the solitary waves for the Generalized Benjamin-Bona-Mahony Equation - [2309.0791] [QA].
Contrastive Feature Masking Open-Vocabulary Vision Transformer - [2309.00775] [QA].
Learning Shared Safety Constraints from Multi-task Demonstrations - [2309.00711] [QA].
Searching for a Leptophilic Z' and a 3-3-1 symmetry at CLIC - [2309.0681] [QA].
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following - [2309.00615] [QA].
CityDreamer: Compositional Generative Model of Unbounded 3D Cities - [2309.00610] [QA].
Rieger, Schwabe, Suess-de Vries: The Sunny Beats of Resonance - [2309.0666] [QA].
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation - [2309.00398] [QA].
FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large Language Models in Federated Learning - [2309.00363] [QA].
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior - [2309.00359] [QA].
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback - [2309.00267] [QA].
A Massively Parallel Dynamic Programming for Approximate Rectangle Escape Problem - [2309.0242] [QA].
Object-Centric Multiple Object Tracking - [2309.00233] [QA].
Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation - [2309.00216] [QA].
Pseudo-magnetic fields in square lattices - [2309.0212] [QA].
Empirical Modeling of Variance in Medium Frequency R-Mode Time-of-Arrival Measurements - [2309.0202] [QA].

August 2023

Block occurrences in the binary expansion - [2309.0142] [QA].
YaRN: Efficient Context Window Extension of Large Language Models - [2309.00071] [QA].
SoDaCam: Software-defined Cameras via Single-Photon Imaging - [2309.00066] [QA].
FACET: Fairness in Computer Vision Evaluation Benchmark - [2309.00035] [QA].
PointLLM: Empowering Large Language Models to Understand Point Clouds - [2308.16911] [QA].
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation - [2308.16909] [QA].
InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion - [2308.16905] [QA].
Transformers as Support Vector Machines - [2308.16898] [QA].
EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild - [2308.16894] [QA].
GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields - [2308.16891] [QA].
TouchStone: Evaluating Vision-Language Models by Language Models - [2308.16890] [QA].
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants - [2308.16884] [QA].
SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation - [2308.16876] [QA].
Coarse-to-Fine Amodal Segmentation with Shape Prior - [2308.16825] [QA].
Can Programming Languages Boost Each Other via Instruction Tuning? - [2308.16824] [QA].
Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models - [2308.16777] [QA].
Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images - [2308.16758] [QA].
Parsing is All You Need for Accurate Gait Recognition in the Wild - [2308.16739] [QA].
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation - [2308.16689] [QA].
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images - [2308.16582] [QA].
MVDream: Multi-view Diffusion for 3D Generation - [2308.16512] [QA].
Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations - [2308.16505] [QA].
PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction - [2308.16477] [QA].
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models - [2308.16463] [QA].
Improving Lens Flare Removal with General Purpose Pipeline and Multiple Light Sources Recovery - [2308.16460] [QA].
BioCoder: A Benchmark for Bioinformatics Code Generation with Contextual Pragmatic Knowledge - [2308.16458] [QA].
Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff - [2308.16454] [QA].
Emergence of Segmentation with Minimalistic White-Box Transformers - [2308.16271] [QA].
Active Neural Mapping - [2308.16246] [QA].
Learning Vision-based Pursuit-Evasion Robot Policies - [2308.16185] [QA].
SAM-Med2D - [2308.16184] [QA].
MMVP: Motion-Matrix-based Video Prediction - [2308.16154] [QA].
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models - [2308.16137] [QA].
Response: Emergent analogical reasoning in large language models - [2308.16118] [QA].
Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-Spectral Image Fusion - [2308.16083] [QA].
RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation - [2308.15975] [QA].
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model - [2308.15962] [QA].
LLaSM: Large Language and Speech Model - [2308.15930] [QA].
Reconstructing Groups of People with Hypergraph Relational Reasoning - [2308.15844] [QA].
Introducing Language Guidance in Prompt-based Continual Learning - [2308.15827] [QA].
WeatherBench 2: A benchmark for the next generation of data-driven global weather models - [2308.15560] [QA].
Canonical Factors for Hybrid Neural Fields - [2308.15461] [QA].
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision - [2308.15512] [QA].
Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation - [2308.15367] [QA].
CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation - [2308.15226] [QA].
Evaluation and Analysis of Hallucination in Large Vision-Language Models - [2308.15126] [QA].
Learning to Upsample by Learning to Sample - [2308.15085] [QA].
Class Prior-Free Positive-Unlabeled Learning with Taylor Variational Loss for Hyperspectral Remote Sensing Imagery - [2308.15081] [QA].
Exploring Model Transferability through the Lens of Potential Energy - [2308.15074] [QA].
Pose-Free Neural Radiance Fields via Implicit Pose Regularization - [2308.15049] [QA].
Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models - [2308.15022] [QA].
Vision Grid Transformer for Document Layout Analysis - [2308.14978] [QA].
LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks - [2308.14972] [QA].
Vector Search with OpenAI Embeddings: Lucene Is All You Need - [2308.14963] [QA].
Read-only Prompt Optimization for Vision-Language Few-shot Learning - [2308.14960] [QA].
NSF: Neural Surface Fields for Human Modeling from Monocular Depth - [2308.14847] [QA].
CLNeRF: Continual Learning Meets NeRF - [2308.14816] [QA].
Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond - [2308.14753] [QA].
AI Deception: A Survey of Examples, Risks, and Potential Solutions - [2308.14752] [QA].
R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras - [2308.14713] [QA].
S-TREK: Sequential Translation and Rotation Equivariant Keypoints for local feature extraction - [2308.14598] [QA].
Referring Image Segmentation Using Text Supervision - [2308.14575] [QA].
LAC: Latent Action Composition for Skeleton-based Action Segmentation - [2308.14500] [QA].
Priority-Centric Human Motion Generation in Discrete Latent Space - [2308.14480] [QA].
Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor - [2308.14383] [QA].
ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models - [2308.14353] [QA].
DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation - [2308.14346] [QA].
Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection - [2308.14286] [QA].
HoloFusion: Towards Photo-realistic 3D Generative Modeling - [2308.14244] [QA].
High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net - [2308.14221] [QA].
Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks - [2308.14153] [QA].
Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers - [2308.14152] [QA].
Semi-Supervised Learning in the Few-Shot Zero-Shot Scenario - [2308.14119] [QA].
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records - [2308.14089] [QA].
4D Myocardium Reconstruction with Decoupled Motion and Shape Model - [2308.14083] [QA].
Reconstructing Interacting Hands with Interaction Prior from Monocular Images - [2308.14082] [QA].
Nonrigid Object Contact Estimation With Regional Unwrapping Transformer - [2308.14074] [QA].
Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection - [2308.14061] [QA].
Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation - [2308.14023] [QA].
Calibrating Panoramic Depth Estimation for Practical Localization and Mapping - [2308.14005] [QA].
LDL: Line Distance Functions for Panoramic Localization - [2308.13989] [QA].
Prior-guided Source-free Domain Adaptation for Human Pose Estimation - [2308.13954] [QA].
Late Stopping: Avoiding Confidently Learning from Mislabeled Examples - [2308.13862] [QA].
Beyond One-to-One: Rethinking the Referring Image Segmentation - [2308.13853] [QA].
Point-Query Quadtree for Crowd Counting, Localization, and More - [2308.13814] [QA].
ORES: Open-vocabulary Responsible Visual Synthesis - [2308.13785] [QA].
Generalized Lightness Adaptation with Channel Selective Normalization - [2308.13783] [QA].
MST-compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree - [2308.13735] [QA].
ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning - [2308.13724] [QA].
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation - [2308.13505] [QA].
A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance - [2308.13504] [QA].
Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers - [2308.13494] [QA].
Leveraging Knowledge and Reinforcement Learning for Enhanced Reliability of Language Models - [2308.13467] [QA].
Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models - [2308.13437] [QA].
Nougat: Neural Optical Understanding for Academic Documents - [2308.13418] [QA].
SoTaNa: The Open-Source Software Development Assistant - [2308.13416] [QA].
Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning - [2308.13411] [QA].
Relighting Neural Radiance Fields with Shadow and Highlight Hints - [2308.13404] [QA].
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs - [2308.13387] [QA].
Distribution-Aligned Diffusion for Human Mesh Recovery - [2308.13369] [QA].
ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis - [2308.13324] [QA].
SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation - [2308.13323] [QA].
A Game of Bundle Adjustment -- Learning Efficient Convergence - [2308.13270] [QA].
Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation - [2308.13266] [QA].
Unpaired Multi-domain Attribute Translation of 3D Facial Shapes with a Square and Symmetric Geometric Map - [2308.13245] [QA].
Black-box Unsupervised Domain Adaptation with Bi-directional Atkinson-Shiffrin Memory - [2308.13236] [QA].
ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking - [2308.13229] [QA].
MultiCapCLIP: Auto-Encoding Prompts for Zero-Shot Multilingual Visual Captioning - [2308.13218] [QA].
IOMatch: Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers Utilization - [2308.13168] [QA].
Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model - [2308.13164] [QA].
SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research - [2308.13149] [QA].
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models - [2308.13137] [QA].
MLLM-DataEngine: An Iterative Refinement Approach for MLLM - [2308.13566] [QA].
Preserving Modality Structure Improves Multi-Modal Learning - [2308.13077] [QA].
NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes - [2308.12967] [QA].
Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation - [2308.12968] [QA].
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities - [2308.12966] [QA].
Dense Text-to-Image Generation with Attention Modulation - [2308.12964] [QA].
MapPrior: Bird's-Eye View Map Layout Estimation with Generative Models - [2308.12963] [QA].
Motion-Guided Masking for Spatiotemporal Representation Learning - [2308.12962] [QA].
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment - [2308.12960] [QA].
Code Llama: Open Foundation Models for Code - [2308.12950] [QA].
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? - [2308.12898] [QA].
Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings - [2308.12894] [QA].
ToonTalker: Cross-Domain Face Reenactment - [2308.12866] [QA].
Fast Adversarial Training with Smooth Convergence - [2308.12857] [QA].
On Offline Evaluation of 3D Object Detection for Autonomous Driving - [2308.12779] [QA].
LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition - [2308.12774] [QA].
VIGC: Visual Instruction Generation and Correction - [2308.12714] [QA].
A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions - [2308.12700] [QA].
PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation - [2308.12604] [QA].
Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation - [2308.12595] [QA].
Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects - [2308.12590] [QA].
Grounded Entity-Landmark Adaptive Pre-training for Vision-and-Language Navigation - [2308.12587] [QA].
Hyperbolic Audio-visual Zero-shot Learning - [2308.12558] [QA].
Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking - [2308.12549] [QA].
CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias - [2308.12539] [QA].
Masked Autoencoders are Efficient Class Incremental Learners - [2308.12510] [QA].
CGMI: Configurable General Multi-Agent Interaction Framework - [2308.12503] [QA].
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning - [2308.12383] [QA].
Vision Transformer Adapters for Generalizable Multitask Learning - [2308.12372] [QA].
AdVerb: Visually Guided Audio Dereverberation - [2308.12370] [QA].
Continual Zero-Shot Learning through Semantically Guided Generative Random Walks - [2308.12366] [QA].
Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation - [2308.12350] [QA].
Improving Generative Model-based Unfolding with Schrödinger Bridges - [2308.12351] [QA].
CHORUS: Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images - [2308.12288] [QA].
Simple is Better and Large is Not Enough: Towards Ensembling of Foundational Language Models - [2308.12272] [QA].
Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning - [2308.12219] [QA].
SG-Former: Self-guided Transformer with Evolving Token Reallocation - [2308.12216] [QA].
CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No - [2308.12213] [QA].
Curriculum Learning with Adam: The Devil Is in the Wrong Details - [2308.12202] [QA].
Sign Language Translation with Iterative Prototype - [2308.12191] [QA].
SILT: Shadow-aware Iterative Label Tuning for Learning to Detect Shadows from Noisy Labels - [2308.12064] [QA].
DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration - [2308.12058] [QA].
Aligning Language Models with Offline Reinforcement Learning from Human Feedback - [2308.12050] [QA].
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages - [2308.12038] [QA].
RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D - [2308.12035] [QA].
From Instructions to Intrinsic Human Values -- A Survey of Alignment Goals for Big Models - [2308.12014] [QA].
RankMixup: Ranking-Based Mixup Training for Network Calibration - [2308.11990] [QA].
Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields - [2308.11974] [QA].
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE - [2308.11971] [QA].
OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes - [2308.11928] [QA].
Recovering a Molecule's 3D Dynamics from Liquid-phase Electron Microscopy Movies - [2308.11927] [QA].
LFS-GAN: Lifelong Few-Shot Image Generation - [2308.11917] [QA].
Semantic-Aware Implicit Template Learning via Part Deformation Consistency - [2308.11916] [QA].
ACLS: Adaptive and Conditional Label Smoothing for Network Calibration - [2308.11911] [QA].
Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification - [2308.11901] [QA].
Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack - [2308.11894] [QA].
SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets - [2308.11880] [QA].
Semi-Supervised Learning via Weight-aware Distillation under Class Distribution Mismatch - [2308.11874] [QA].
Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations - [2308.11796] [QA].
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts - [2308.11793] [QA].
Understanding Hessian Alignment for Domain Generalization - [2308.11778] [QA].
Efficient Controllable Multi-Task Architectures - [2308.11744] [QA].
Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape - [2308.11737] [QA].
Efficient Benchmarking (of Language Models) - [2308.11696] [QA].
Delving into Motion-Aware Matching for Monocular 3D Object Tracking - [2308.11607] [QA].
StoryBench: A Multifaceted Benchmark for Continuous Story Visualization - [2308.11606] [QA].
SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation - [2308.11568] [QA].
Multi-event Video-Text Retrieval - [2308.11551] [QA].
TrackFlow: Multi-Object Tracking with Normalizing Flows - [2308.11513] [QA].
Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition - [2308.11489] [QA].
Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection - [2308.11441] [QA].
A Survey on Large Language Model based Autonomous Agents - [2308.11432] [QA].
ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes - [2308.11417] [QA].
How Much Temporal Long-Term Context is Needed for Action Segmentation? - [2308.11358] [QA].
Exemplar-Free Continual Transformer with Convolutions - [2308.11357] [QA].
ProAgent: Building Proactive Cooperative AI with Large Language Models - [2308.11339] [QA].
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training - [2308.11331] [QA].
CiteTracker: Correlating Image and Text for Visual Tracking - [2308.11322] [QA].
CNN based Cuneiform Sign Detection Learned from Annotated 3D Renderings and Mapped Photographs with Illumination Augmentation - [2308.11277] [QA].
HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations - [2308.11261] [QA].
ROSGPT_Vision: Commanding Robots Using Only Language Models' Prompts - [2308.11236] [QA].
LDP-Feat: Image Features with Local Differential Privacy - [2308.11223] [QA].
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment - [2308.11206] [QA].
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data - [2308.11194] [QA].
Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models - [2308.11186] [QA].
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation - [2308.11185] [QA].
ReFit: Recurrent Fitting Network for 3D Human Recovery - [2308.11184] [QA].
Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation - [2308.11166] [QA].
Domain Generalization via Rationale Invariance - [2308.11158] [QA].
Efficient View Synthesis with Neural Radiance Distribution Field - [2308.11130] [QA].
LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction - [2308.11116] [QA].
CAME: Contrastive Automated Model Evaluation - [2308.11111] [QA].
Recursive Video Lane Detection - [2308.11106] [QA].
MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers - [2308.11096] [QA].
Video OWL-ViT: Temporally-consistent open-world localization in video - [2308.11093] [QA].
Audio-Visual Class-Incremental Learning - [2308.11073] [QA].
TeD-SPAD: Temporal Distinctiveness for Self-supervised Privacy-preservation for video Anomaly Detection - [2308.11072] [QA].
Neural Amortized Inference for Nested Multi-agent Reasoning - [2308.11071] [QA].
MetaGCD: Learning to Continually Learn in Generalized Category Discovery - [2308.11063] [QA].
UnLoc: A Unified Framework for Video Localization Tasks - [2308.11062] [QA].
Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction - [2308.11025] [QA].
Spectral Graphormer: Spectral Graph-based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images - [2308.11015] [QA].
Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation - [2308.10898] [QA].
Can Language Models Learn to Listen? - [2308.10897] [QA].
EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition - [2308.10832] [QA].
Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction - [2308.10820] [QA].
Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers - [2308.10814] [QA].
Improving Continuous Sign Language Recognition with Cross-Lingual Signs - [2308.10809] [QA].
MGMAE: Motion Guided Masking for Video Masked Autoencoding - [2308.10794] [QA].
Instruction Tuning for Large Language Models: A Survey - [2308.10792] [QA].
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models - [2308.10755] [QA].
On the Adversarial Robustness of Multi-Modal Foundation Models - [2308.10741] [QA].
Patch Is Not All You Need - [2308.10729] [QA].
Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction - [2308.10694] [QA].
Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term Person Re-Identification - [2308.10658] [QA].
GaitPT: Skeletons Are All You Need For Gait Recognition - [2308.10623] [QA].
A step towards understanding why classification helps regression - [2308.10603] [QA].
Image-free Classifier Injection for Zero-Shot Classification - [2308.10599] [QA].
CHORD: Category-level Hand-held Object Reconstruction via Shape Deformation - [2308.10574] [QA].
Self-Feedback DETR for Temporal Action Detection - [2308.10570] [QA].
Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations - [2308.10554] [QA].
QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection - [2308.10515] [QA].
Large Language Model as a User Simulator - [2308.11534] [QA].
Texture Generation on 3D Meshes with Point-UV Diffusion - [2308.10490] [QA].
ADNet: Lane Shape Prediction via Anchor Decomposition - [2308.10481] [QA].
STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning - [2308.10468] [QA].
Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models - [2308.10462] [QA].
Privacy-Preserving Face Recognition Using Random Frequency Components - [2308.10461] [QA].
Explore and Tell: Embodied Visual Captioning in 3D Environments - [2308.10447] [QA].
When Prompt-based Incremental Learning Does Not Meet Strong Pretraining - [2308.10445] [QA].
X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events - [2308.10441] [QA].
GPT-in-the-Loop: Adaptive Decision-Making for Multiagent Systems - [2308.10435] [QA].
Diffusion Model as Representation Learner - [2308.10916] [QA].
Simple Baselines for Interactive Video Retrieval with Questions and Answers - [2308.10402] [QA].
FairBench: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models - [2308.10397] [QA].
Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models - [2308.10379] [QA].
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models - [2308.11462] [QA].
Strata-NeRF : Neural Radiance Fields for Stratified Scenes - [2308.10337] [QA].
Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos - [2308.10334] [QA].
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting - [2308.10315] [QA].
DVGaze: Dual-View Gaze Estimation - [2308.10310] [QA].
Representation Disparity-aware Distillation for 3D Object Detection - [2308.10308] [QA].
Omnidirectional Information Gathering for Knowledge Transfer-based Audio-Visual Navigation - [2308.10306] [QA].
Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video - [2308.10305] [QA].
DomainAdaptor: A Novel Approach to Test-time Adaptation - [2308.10297] [QA].
DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization - [2308.10285] [QA].
GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning - [2308.10279] [QA].
CharacterChat: Learning towards Conversational AI with Personalized Social Support - [2308.10278] [QA].
Minimalist Traffic Prediction: Linear Layer Is All You Need - [2308.10276] [QA].
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data - [2308.10253] [QA].
GeT: Generative Target Structure Debiasing for Domain Adaptation - [2308.10205] [QA].
ChatEDA: A Large Language Model Powered Autonomous Agent for EDA - [2308.10204] [QA].
ViT-Lens: Towards Omni-modal Representations - [2308.10185] [QA].
Neural Interactive Keypoint Detection - [2308.10174] [QA].
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation - [2308.10172] [QA].
FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory - [2308.10170] [QA].
Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection - [2308.10155] [QA].
A Survey on Fairness in Large Language Models - [2308.10149] [QA].
ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer - [2308.10147] [QA].
OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision - [2308.10146] [QA].
ExpeL: LLM Agents Are Experiential Learners - [2308.10144] [QA].
March in Chat: Interactive Prompting for Remote Embodied Referring Expression - [2308.10141] [QA].
AutoReP: Automatic ReLU Replacement for Fast Private Network Inference - [2308.10134] [QA].
TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective - [2308.10133] [QA].
3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation - [2308.10123] [QA].
HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation - [2308.10122] [QA].
Robust Mixture-of-Expert Training for Convolutional Neural Networks - [2308.10110] [QA].
Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos - [2308.10089] [QA].
GameEval: Evaluating LLMs on Conversational Games - [2308.10032] [QA].
Single Image Reflection Separation via Component Synergy - [2308.10027] [QA].
Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation - [2308.10016] [QA].
Partition-and-Debias: Agnostic Biases Mitigation via A Mixture of Biases-Specific Experts - [2308.10005] [QA].
ClothesNet: An Information-Rich 3D Garment Model Repository with Simulated Clothes Environment - [2308.09987] [QA].
FinEval: A Chinese Financial Domain Knowledge Evaluation Benchmark for Large Language Models - [2308.09975] [QA].
Disposable Transfer Learning for Selective Source Task Unlearning - [2308.09971] [QA].
Tackling Vision Language Tasks Through Learning Inner Monologues - [2308.09970] [QA].
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos - [2308.09951] [QA].
Scene-Aware Feature Matching - [2308.09949] [QA].
Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling - [2308.09946] [QA].
On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion - [2308.09942] [QA].
Understanding Self-attention Mechanism via Dynamical System Perspective - [2308.09939] [QA].
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions - [2308.09936] [QA].
MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition - [2308.09922] [QA].
VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations - [2308.09916] [QA].
Scalable Video Object Segmentation with Simplified Framework - [2308.09903] [QA].
SwinLSTM:Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM - [2308.09891] [QA].
Calibrating Uncertainty for Semi-Supervised Crowd Counting - [2308.09887] [QA].
Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders - [2308.09882] [QA].
Skill Transformer: A Monolithic Policy for Mobile Manipulation - [2308.09873] [QA].
A Theory of Topological Derivatives for Inverse Rendering of Geometry - [2308.09865] [QA].
How susceptible are LLMs to Logical Fallacies? - [2308.09853] [QA].
Learning from A Single Graph is All You Need for Near-Shortest Path Routing in Wireless Networks - [2308.09829] [QA].
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control - [2308.09804] [QA].
Long-range Multimodal Pretraining for Movie Understanding - [2308.09775] [QA].
Smoothness Similarity Regularization for Few-Shot GAN Adaptation - [2308.09717] [QA].
Robust Monocular Depth Estimation under Challenging Conditions - [2308.09711] [QA].
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment - [2308.09662] [QA].
Is context all you need? Scaling Neural Sign Language Translation to Large Domains of Discourse - [2308.09622] [QA].
LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark - [2308.09618] [QA].
ChatHaruhi: Reviving Anime Character in Reality via Large Language Model - [2308.09597] [QA].
StableVideo: Text-driven Consistency-aware Diffusion Video Editing - [2308.09592] [QA].
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct - [2308.09583] [QA].
PUMGPT: A Large Vision-Language Model for Product Understanding - [2308.09568] [QA].
Normalization Is All You Need: Understanding Layer-Normalized Federated Learning under Extreme Label Shift - [2308.09565] [QA].
Deep Equilibrium Object Detection - [2308.09564] [QA].
Meta-ZSDETR: Zero-shot DETR with Meta-learning - [2308.09540] [QA].
Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning - [2308.09534] [QA].
Leveraging Intrinsic Properties for Non-Rigid Garment Alignment - [2308.09519] [QA].
ResQ: Residual Quantization for Video Perception - [2308.09511] [QA].
Vision Relation Transformer for Unbiased Scene Graph Generation - [2308.09472] [QA].
Scope is all you need: Transforming LLMs for HPC Code - [2308.09440] [QA].
MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection - [2308.09421] [QA].
Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization - [2308.09391] [QA].
DReg-NeRF: Deep Registration for Neural Radiance Fields - [2308.09386] [QA].
Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events - [2308.09383] [QA].
Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models - [2308.09363] [QA].
RLIPv2: Fast Scaling of Relational Language-Image Pre-training - [2308.09351] [QA].
Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching - [2308.09346] [QA].
Audio-Visual Glance Network for Efficient Video Recognition - [2308.09322] [QA].
Towards Attack-tolerant Federated Learning via Critical Parameter Analysis - [2308.09318] [QA].
Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation - [2308.09314] [QA].
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge - [2308.09311] [QA].
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability - [2308.09306] [QA].
Human Part-wise 3D Motion Context Learning for Sign Language Recognition - [2308.09305] [QA].
NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization for Continual Learning - [2308.09297] [QA].
Self-Calibrated Cross Attention Network for Few-Shot Segmentation - [2308.09294] [QA].
Diverse Cotraining Makes Strong Semi-Supervised Segmentor - [2308.09281] [QA].
Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos - [2308.09247] [QA].
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos - [2308.09245] [QA].
SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos - [2308.09244] [QA].
ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation - [2308.09242] [QA].
Generalized Sum Pooling for Metric Learning - [2308.09228] [QA].
FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning - [2308.09160] [QA].
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation - [2308.09139] [QA].
ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection - [2308.09098] [QA].
SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning - [2308.09040] [QA].
Reinforced Self-Training (ReST) for Language Modeling - [2308.08998] [QA].
Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction - [2308.08942] [QA].
Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-identification - [2308.08887] [QA].
Event-Guided Procedure Planning from Instructional Videos with Text Supervision - [2308.08885] [QA].
Towards Semi-supervised Learning with Non-random Missing Labels - [2308.08872] [QA].
Spatially and Spectrally Consistent Deep Functional Maps - [2308.08871] [QA].
D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field - [2308.08857] [QA].
Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling - [2308.08855] [QA].
CMB: A Comprehensive Medical Benchmark in Chinese - [2308.08833] [QA].
Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction - [2308.08824] [QA].
MixBag: Bag-Level Data Augmentation for Learning from Label Proportions - [2308.08822] [QA].
Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts - [2308.08810] [QA].
Long-Range Grouping Transformer for Multi-View 3D Reconstruction - [2308.08724] [QA].
V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints - [2308.08715] [QA].
Dynamic Neural Network is All You Need: Understanding the Robustness of Dynamic Mechanisms in Neural Networks - [2308.08709] [QA].
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans - [2308.08545] [QA].
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions - [2308.08544] [QA].
Learning to Distill Global Representation for Sparse-View CT - [2308.08463] [QA].
ALIP: Adaptive Language-Image Pre-training with Synthetic Caption - [2308.08428] [QA].
Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer - [2308.08414] [QA].
SIGMA: Scale-Invariant Global Sparse Shape Matching - [2308.08393] [QA].
Agglomerative Transformer for Human-Object Interaction Detection - [2308.08370] [QA].
Membrane Potential Batch Normalization for Spiking Neural Networks - [2308.08359] [QA].
Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations - [2308.08321] [QA].
Dual-Stream Diffusion Net for Text-to-Video Generation - [2308.08316] [QA].
SceNeRFlow: Time-Consistent Reconstruction of General Dynamic Scenes - [2308.08258] [QA].
MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation - [2308.08239] [QA].
Inherent Redundancy in Spiking Neural Networks - [2308.08227] [QA].
Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network - [2308.08220] [QA].
Unsupervised Domain Adaptive Detection with Network Stability Analysis - [2308.08182] [QA].
Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis - [2308.08157] [QA].
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework - [2308.08155] [QA].
GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds - [2308.08140] [QA].
OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution - [2308.08114] [QA].
View Consistent Purification for Accurate Cross-View Localization - [2308.08110] [QA].
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation - [2308.08090] [QA].
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory - [2308.08089] [QA].
Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction - [2308.08011] [QA].
Teach LLMs to Personalize -- An Approach inspired by Writing Education - [2308.07968] [QA].
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing - [2308.07926] [QA].
RAVEN: In-Context Learning with Retrieval Augmented Encoder-Decoder Language Models - [2308.07922] [QA].
Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification - [2308.07921] [QA].
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model - [2308.07918] [QA].
Relightable and Animatable Neural Avatar from Sparse-View Video - [2308.07903] [QA].
Through the Lens of Core Competency: Survey on Evaluation of Large Language Models - [2308.07902] [QA].
Memory-and-Anticipation Transformer for Online Action Understanding - [2308.07893] [QA].
Link-Context Learning for Multimodal LLMs - [2308.07891] [QA].
ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces - [2308.07868] [QA].
StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models - [2308.07863] [QA].
Robustness Over Time: Understanding Adversarial Examples' Effectiveness on Longitudinal Versions of Large Language Models - [2308.07847] [QA].
ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition - [2308.07815] [QA].
Learning to Identify Critical States for Reinforcement Learning from Videos - [2308.07795] [QA].
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding - [2308.07787] [QA].
Identity-Consistent Aggregation for Video Object Detection - [2308.07737] [QA].
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation - [2308.07732] [QA].
DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion Models - [2308.07687] [QA].
Boosting Multi-modal Model Performance with Adaptive Gradient Modulation - [2308.07686] [QA].
Attention Is Not All You Need Anymore - [2308.07661] [QA].
From Commit Message Generation to History-Aware Commit Message Completion - [2308.07655] [QA].
EQ-Net: Elastic Quantization Neural Networks - [2308.07650] [QA].
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval - [2308.07648] [QA].
Backpropagation Path Search On Adversarial Transferability - [2308.07625] [QA].
Story Visualization by Online Text Augmentation with Context Memory - [2308.07575] [QA].
3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack - [2308.07546] [QA].
DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation - [2308.07498] [QA].
Exploring the Intersection of Large Language Models and Agent-Based Modeling via Prompt Engineering - [2308.07411] [QA].
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models - [2308.07395] [QA].
PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects - [2308.07391] [QA].
Platypus: Quick, Cheap, and Powerful Refinement of LLMs - [2308.07317] [QA].
Jurassic World Remake: Bringing Ancient Fossils Back to Life via Zero-Shot Long Image-to-Image Translation - [2308.07316] [QA].
Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation - [2308.07313] [QA].
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation - [2308.07286] [QA].
Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents - [2308.07241] [QA].
RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs - [2308.07228] [QA].
Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning - [2308.07209] [QA].
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate - [2308.07201] [QA].
OctoPack: Instruction Tuning Code Large Language Models - [2308.07124] [QA].
CTP: Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation - [2308.07146] [QA].
Occ$^2$Net: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions - [2308.16160] [QA].
Mind your Language (Model): Fact-Checking LLMs and their Role in NLP Research and Practice - [2308.07120] [QA].
Large Language Models for Information Retrieval: A Survey - [2308.07107] [QA].
Masked Motion Predictors are Strong 3D Action Representation Learners - [2308.07092] [QA].
S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields - [2308.07032] [QA].
ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion - [2308.07009] [QA].
Global Features are All You Need for Image Retrieval and Reranking - [2308.06954] [QA].
Knowing Where to Focus: Event-aware Transformer for Video Grounding - [2308.06947] [QA].
CBA: Improving Online Continual Learning via Continual Bias Adaptor - [2308.06925] [QA].
CausalLM is not optimal for in-context learning - [2308.06912] [QA].
Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking - [2308.06904] [QA].
Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization - [2308.06879] [QA].
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer - [2308.06873] [QA].
RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks - [2308.06787] [QA].
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning - [2308.06777] [QA].
Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches - [2308.06776] [QA].
Dual Meta-Learning with Longitudinally Generalized Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan - [2308.06774] [QA].
AerialVLN: Vision-and-Language Navigation for UAVs - [2308.06735] [QA].
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models - [2308.06721] [QA].
Compositional Feature Augmentation for Unbiased Scene Graph Generation - [2308.06712] [QA].
Camouflaged Image Synthesis Is All You Need to Boost Camouflaged Detection - [2308.06701] [QA].
Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation - [2308.06693] [QA].
Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training - [2308.06689] [QA].
3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking - [2308.06635] [QA].
VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use - [2308.06595] [QA].
Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction - [2308.06554] [QA].
Revisiting Vision Transformer from the View of Path Ensemble - [2308.06548] [QA].
SegPrompt: Boosting Open-world Segmentation via Category-level Prompt Learning - [2308.06531] [QA].
BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation - [2308.06530] [QA].
One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training - [2308.07934] [QA].
Tiny and Efficient Model for the Edge Detection Generalization - [2308.06468] [QA].
Multi-Label Knowledge Distillation - [2308.06453] [QA].
Detecting and Preventing Hallucinations in Large Vision Language Models - [2308.06394] [QA].
U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds - [2308.06383] [QA].
Enhancing Network Management Using Code Generated by Large Language Models - [2308.06261] [QA].
Self-Alignment with Instruction Backtranslation - [2308.06259] [QA].
FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods - [2308.06248] [QA].
Exploring Predicate Visual Context in Detecting of Human-Object Interactions - [2308.06202] [QA].
Improving Joint Speech-Text Representations Without Alignment - [2308.06125] [QA].
Composable Function-preserving Expansions for Transformer Architectures - [2308.06103] [QA].
Out-of-Distribution Detection for Monocular Depth Estimation - [2308.06072] [QA].
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning - [2308.06038] [QA].
Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation - [2308.06015] [QA].
Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection - [2308.05991] [QA].
TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models - [2308.05985] [QA].
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents - [2308.05960] [QA].
Generalizing Event-Based Motion Deblurring in Real-World Scenarios - [2308.05932] [QA].
Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking - [2308.05911] [QA].
PIPPA: A Partially Synthetic Conversational Dataset - [2308.05884] [QA].
PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs - [2308.05744] [QA].
Follow Anything: Open-set detection, tracking, and following in real-time - [2308.05737] [QA].
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining - [2308.05734] [QA].
FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models - [2308.05733] [QA].
PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers - [2308.05732] [QA].
Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient - [2308.05681] [QA].
2D3D-MATR: 2D-3D Matching Transformer for Detection-free Registration between Images and Point Clouds - [2308.05667] [QA].
Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network - [2308.05605] [QA].
Cross-Domain Product Representation Learning for Rich-Content E-Commerce - [2308.05550] [QA].
Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation - [2308.05493] [QA].
LLM As DBA - [2308.05481] [QA].
Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation - [2308.05441] [QA].
Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation - [2308.05438] [QA].
SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data - [2308.05410] [QA].
Learning Gabor Texture Features for Fine-Grained Recognition - [2308.05396] [QA].
Enhancing Trust in LLM-Based AI Automation Agents: New Considerations and Future Challenges - [2308.05391] [QA].
Interaction-aware Joint Attention Estimation Using People Attributes - [2308.05382] [QA].
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment - [2308.05374] [QA].
Flexible Isosurface Extraction for Gradient-Based Mesh Optimization - [2308.05371] [QA].
Pseudo-label Alignment for Semi-supervised Instance Segmentation - [2308.05359] [QA].
OpenProteinSet: Training data for structural biology at scale - [2308.05326] [QA].
RLSAC: Reinforcement Learning enhanced Sample Consensus for End-to-End Robust Estimation - [2308.05318] [QA].
Alexa, play with robot: Introducing the First Alexa Prize SimBot Challenge on Embodied AI - [2308.05221] [QA].
LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation - [2308.05095] [QA].
Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution - [2308.05022] [QA].
Robust Object Modeling for Visual Tracking - [2308.05140] [QA].
IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models - [2308.04995] [QA].
Foreground Object Search by Distilling Composite Image Feature - [2308.04990] [QA].
Prototypical Kernel Learning and Open-set Foreground Perception for Generalized Few-shot Semantic Segmentation - [2308.04952] [QA].
SelectNAdapt: Support Set Selection for Few-Shot Domain Adaptation - [2308.04946] [QA].
LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking - [2308.04945] [QA].
Cross-view Semantic Alignment for Livestreaming Product Recognition - [2308.04912] [QA].
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation - [2308.04829] [QA].
WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields - [2308.04826] [QA].
Joint-Relation Transformer for Multi-Person Motion Prediction - [2308.04808] [QA].
PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration - [2308.04782] [QA].
Objects do not disappear: Video object detection by single-frame object location anticipation - [2308.04770] [QA].
Bird's-Eye-View Scene Graph for Vision-Language Navigation - [2308.04758] [QA].
JEN-1: Text-Guided Universal Music Generation with Omnidirectional Diffusion Models - [2308.04729] [QA].
GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization - [2308.04699] [QA].
Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising - [2308.04682] [QA].
Accelerating LLM Inference with Staged Speculative Decoding - [2308.04623] [QA].
Rendering Humans from Object-Occluded Monocular Videos - [2308.04622] [QA].
Shepherd: A Critic for Language Model Generation - [2308.04592] [QA].
LATR: 3D Lane Detection from Monocular Images with Transformer - [2308.04583] [QA].
FocalFormer3D : Focusing on Hard Instance for 3D Object Detection - [2308.04556] [QA].
Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation - [2308.04549] [QA].
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore - [2308.04430] [QA].
DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds - [2308.04383] [QA].
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment - [2308.04352] [QA].
A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages - [2308.04477] [QA].
Lossy and Lossless (L$^2$) Post-training Model Size Compression - [2308.04269] [QA].
FLIRT: Feedback Loop In-context Red Teaming - [2308.04265] [QA].
Exploring Transformers for Open-world Instance Segmentation - [2308.04206] [QA].
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation - [2308.04197] [QA].
Under-Display Camera Image Restoration with Scattering Effect - [2308.04163] [QA].
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation - [2308.04162] [QA].
Empowering Vision-Language Models to Follow Interleaved Vision-Language Instructions - [2308.04152] [QA].
OmniDataComposer: A Unified Data Structure for Multimodal Data Fusion and Infinite Data Generation - [2308.04126] [QA].
3D Gaussian Splatting for Real-Time Radiance Field Rendering - [2308.04079] [QA].
Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation - [2308.04061] [QA].
Gentopia: A Collaborative Platform for Tool-Augmented LLMs - [2308.04030] [QA].
AgentSims: An Open-Source Sandbox for Large Language Model Evaluation - [2308.04026] [QA].
Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning - [2308.04016] [QA].
Continual Pre-Training of Large Language Models: How to (re)warm your model? - [2308.04014] [QA].
Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval - [2308.04008] [QA].
PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection - [2308.03982] [QA].
Simple synthetic data reduces sycophancy in large language models - [2308.03958] [QA].
TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models - [2308.03906] [QA].
From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal - [2308.03867] [QA].
3D Motion Magnification: Visualizing Subtle Motions with Time Varying Radiance Fields - [2308.03757] [QA].
Tiny LVLM-eHub: Early Multimodal Experiments with Bard - [2308.03729] [QA].
Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience - [2308.03712] [QA].
AgentBench: Evaluating LLMs as Agents - [2308.03688] [QA].
Learning Concise and Descriptive Attributes for Visual Recognition - [2308.03685] [QA].
AvatarVerse: High-quality & Stable 3D Avatar Creation from Text and Pose - [2308.03610] [QA].
FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision - [2308.03594] [QA].
AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning - [2308.03526] [QA].
Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising - [2308.03448] [QA].
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents - [2308.03427] [QA].
RecycleGPT: An Autoregressive Language Model with Recyclable Module - [2308.03421] [QA].
GaFET: Learning Geometry-aware Facial Expression Translation from In-The-Wild Images - [2308.03413] [QA].
Heterogeneous Forgetting Compensation for Class-Incremental Learning - [2308.03374] [QA].
Dual Aggregation Transformer for Image Super-Resolution - [2308.03364] [QA].
Foundation Model based Open Vocabulary Task Planning and Executive System for General Purpose Service Robots - [2308.03357] [QA].
SciGraphQA: A Large-Scale Synthetic Multi-Turn Question-Answering Dataset for Scientific Graphs - [2308.03349] [QA].
Part-Aware Transformer for Generalizable Person Re-identification - [2308.03322] [QA].
Studying Large Language Model Generalization with Influence Functions - [2308.03296] [QA].
SynJax: Structured Probability Distributions for JAX - [2308.03291] [QA].
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search - [2308.03290] [QA].
Multi-Label Self-Supervised Learning with Scene Images - [2308.03286] [QA].
Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation - [2308.03282] [QA].
Mirror-NeRF: Learning Neural Radiance Fields for Mirrors with Whitted-Style Ray Tracing - [2308.03280] [QA].
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition - [2308.03279] [QA].
A Benchmark for Chinese-English Scene Text Image Super-resolution - [2308.03262] [QA].
Source-free Domain Adaptive Human Pose Estimation - [2308.03202] [QA].
Building Safe and Reliable AI systems for Safety Critical Tasks with Vision-Language Processing - [2308.03176] [QA].
CGBA: Curvature-aware Geometric Black-box Attack - [2308.03163] [QA].
Prototypes-oriented Transductive Few-shot Learning with Conditional Transport - [2308.03047] [QA].
Learning Fine-Grained Features for Pixel-wise Video Correspondences - [2308.03040] [QA].
Pre-Trained Large Language Models for Industrial Control - [2308.03028] [QA].
Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection - [2308.02983] [QA].
An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability - [2308.02897] [QA].
Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation - [2308.02874] [QA].
Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis - [2308.02840] [QA].
EduChat: A Large-Scale Language Model-based Chatbot System for Intelligent Education - [2308.02773] [QA].
DeDrift: Robust Similarity Search under Content Drift - [2308.02752] [QA].
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation - [2308.03793] [QA].
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities - [2308.02490] [QA].
Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP - [2308.02487] [QA].
Getting the Ball Rolling: Learning a Dexterous Policy for a Biomimetic Tendon-Driven Hand with Rolling Contact Joints - [2308.02453] [QA].
Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text - [2308.02357] [QA].
FB-BEV: BEV Representation from Forward-Backward View Transformations - [2308.02236] [QA].
ESRL: Efficient Sampling-based Reinforcement Learning for Sequence Generation - [2308.02223] [QA].
Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology - [2308.02180] [QA].
Learning Referring Video Object Segmentation from Weak Annotation - [2308.02162] [QA].
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization - [2308.02151] [QA].
Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation - [2308.02097] [QA].
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World - [2308.01907] [QA].
DETR Doesn't Need Multi-Scale or Locality Design - [2308.01904] [QA].
ConceptLab: Creative Generation using Diffusion Prior Constraints - [2308.02669] [QA].
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation - [2308.01861] [QA].
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models - [2308.01825] [QA].
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension - [2308.02299] [QA].
Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport - [2308.01779] [QA].
Ambient Adventures: Teaching ChatGPT on Developing Complex Stories - [2308.01734] [QA].
LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment - [2308.01686] [QA].
A Multidimensional Analysis of Social Biases in Vision Transformers - [2308.01948] [QA].
InterAct: Exploring the Potentials of ChatGPT as a Cooperative Agent - [2308.01552] [QA].
Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation - [2308.01547] [QA].
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies - [2308.01546] [QA].
Multimodal Neurons in Pretrained Text-Only Transformers - [2308.01544] [QA].
TDMD: A Database for Dynamic Color Mesh Subjective and Objective Quality Explorations - [2308.01499] [QA].
Target-point Attention Transformer: A novel trajectory predict network for end-to-end autonomous driving - [2308.1496] [QA].
Efficient neural supersampling on a novel gaming dataset - [2308.01483] [QA].
HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions - [2308.01477] [QA].
Training Data Protection with Compositional Diffusion Models - [2308.01937] [QA].
VertexSerum: Poisoning Graph Neural Networks for Link Inference - [2308.01469] [QA].
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion - [2308.02560] [QA].
On $κ$-solutions and canonical neighborhoods in 4d Ricci flow - [2308.1448] [QA].
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models - [2308.01390] [QA].
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales - [2308.01320] [QA].
Computational Long Exposure Mobile Photography - [2308.01379] [QA].
More Context, Less Distraction: Visual Classification by Inferring and Conditioning on Contextual Attributes - [2308.01313] [QA].
Revisiting DETR Pre-training for Object Detection - [2308.01300] [QA].
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models - [2308.01263] [QA].
A Hyper-pixel-wise Contrastive Learning Augmented Segmentation Network for Old Landslide Detection Using High-Resolution Remote Sensing Images and Digital Elevation Model Data - [2308.1251] [QA].
Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation - [2308.01240] [QA].
LSF-IDM: Automotive Intrusion Detection Model with Lightweight Attribution and Semantic Fusion - [2308.1237] [QA].
Grounded Image Text Matching with Mismatched Relation Reasoning - [2308.01236] [QA].
Geometric wakes in collimators and step transitions of arbitrary cross-sections: conformal mapping approach - [2308.1235] [QA].
One Tree to Rule Them All: Poly-Logarithmic Universal Steiner Tree - [2308.1199] [QA].
Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation - [2308.01194] [QA].
Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey - [2308.01191] [QA].
Three-level Dicke quantum battery - [2308.1188] [QA].
Multiobjective Optimization of Non-Smooth PDE-Constrained Problems - [2308.1113] [QA].
Black hole thermodynamics in Horndeski theories - [2308.1082] [QA].
MammoDG: Generalisable Deep Learning Breaks the Limits of Cross-Domain Multi-Center Breast Cancer Screening - [2308.1057] [QA].
Stability Analysis for a Class of Heterogeneous Catalysis Models - [2308.1049] [QA].
Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation - [2308.01045] [QA].
An improved infrastructure for the IceCube realtime system - [2308.1031] [QA].
Model-agnostic search for the quasinormal modes of gravitational wave echoes - [2308.1017] [QA].
Enhancing Representation Learning for Periodic Time Series with Floss: A Frequency Domain Regularization Approach - [2308.1011] [QA].
From Sparse to Soft Mixtures of Experts - [2308.00951] [QA].
Cosmological Distance Measurement of 12 Nearby Supernovae IIP with ROTSE-IIIB - [2308.0916] [QA].
ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation - [2308.00906] [QA].
VLUCI: Variational Learning of Unobserved Confounders for Counterfactual Inference - [2308.0904] [QA].
Weak localization in radiative transfer of acoustic waves in a randomly-fluctuating slab - [2308.0822] [QA].
Optimal design of plane elastic membranes using the convexified Föppl's model - [2308.0811] [QA].
Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body Reconstruction - [2308.00799] [QA].
LISA: Reasoning Segmentation via Large Language Model - [2308.00692] [QA].
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models - [2308.00675] [QA].
Note: Stokes-Einstein relation without hydrodynamic diameter in the TIP4P/Ice water model - [2308.0653] [QA].
ELFNet: Evidential Local-global Fusion for Stereo Matching - [2308.00728] [QA].
Detecting Cloud Presence in Satellite Images Using the RGB-based CLIP Vision-Language Model - [2308.0541] [QA].
Understanding URDF: A Dataset and Analysis - [2308.0514] [QA].
Stochastic Geometry Based Modeling and Analysis on Network NOMA in Downlink CoMP Systems - [2308.0499] [QA].
A many-sorted epistemic logic for chromatic hypergraphs - [2308.0477] [QA].
FLatten Transformer: Vision Transformer using Focused Linear Attention - [2308.00442] [QA].
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning - [2308.00436] [QA].
DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving - [2308.00398] [QA].
Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning - [2308.02533] [QA].
Deep Image Harmonization with Learnable Augmentation - [2308.00376] [QA].
Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation - [2308.00356] [QA].
MetaGPT: Meta Programming for Multi-Agent Collaborative Framework - [2308.00352] [QA].
Artifact: Measuring and Mitigating Gaps in Structural Testing - [2308.0316] [QA].
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models - [2308.00304] [QA].
Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models - [2308.0304] [QA].
Online Prototype Learning for Online Continual Learning - [2308.00301] [QA].
CLAMS: A Cluster Ambiguity Measure for Estimating Perceptual Variability in Visual Clustering - [2308.0284] [QA].
Improving Pixel-based MIM by Reducing Wasted Modeling Capability - [2308.00261] [QA].
GOALS-JWST: Gas Dynamics and Excitation in NGC7469 revealed by NIRSpec - [2308.0209] [QA].

July 2023

Predicting masked tokens in stochastic locations improves masked image modeling - [2308.00566] [QA].
Learning to Model the World with Language - [2308.01399] [QA].
Discovering Adaptable Symbolic Algorithms from Scratch - [2307.16890] [QA].
Virtual Prompt Injection for Instruction-Tuned Large Language Models - [2307.16888] [QA].
Shortcut Partitions in Minor-Free Graphs: Steiner Point Removal, Distance Oracles, Tree Covers, and More - [2308.0555] [QA].
Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy - [2307.16867] [QA].
Random Sub-Samples Generation for Self-Supervised Real Image Denoising - [2307.16825] [QA].
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs - [2307.16789] [QA].
UniVTG: Towards Unified Video-Language Temporal Grounding - [2307.16715] [QA].
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation - [2307.16687] [QA].
Guiding Image Captioning Models Toward More Specific Captions - [2307.16686] [QA].
Graph Structure from Point Clouds: Geometric Attention is All You Need - [2307.16662] [QA].
CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification - [2307.16634] [QA].
FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration - [2307.16617] [QA].
Transferable Decoding with Visual Entities for Zero-Shot Image Captioning - [2307.16525] [QA].
Towards General Low-Light Raw Noise Synthesis and Modeling - [2307.16508] [QA].
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding - [2307.16449] [QA].
DRAW: Defending Camera-shooted RAW against Image Manipulation - [2307.16418] [QA].
DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization - [2307.16415] [QA].
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks - [2307.16395] [QA].
JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery - [2307.16377] [QA].
LP-MusicCaps: LLM-Based Pseudo Music Captioning - [2307.16372] [QA].
AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? - [2307.16368] [QA].
Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples - [2307.16361] [QA].
Evaluating ChatGPT and GPT-4 for Visual Programming - [2308.02522] [QA].
Unified Model for Image, Video, Audio and Language Tasks - [2307.16184] [QA].
Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models - [2307.16180] [QA].
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension - [2307.16125] [QA].
Separate Scene Text Detector for Unseen Scripts is Not All You Need - [2307.15991] [QA].
XMem++: Production-level Video Segmentation From Few Annotated Frames - [2307.15958] [QA].
CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation - [2307.15942] [QA].
What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Network - [2307.15860] [QA].
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control - [2307.15818] [QA].
The Hydra Effect: Emergent Self-repair in Language Model Computations - [2307.15771] [QA].
MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking - [2307.15700] [QA].
Scaling Data Generation in Vision-and-Language Navigation - [2307.15644] [QA].
Robust Distortion-free Watermarks for Language Models - [2307.15593] [QA].
Beating Backdoor Attack at Its Own Game - [2307.15539] [QA].
Exploring Format Consistency for Instruction Tuning - [2307.15504] [QA].
FeedbackLogs: Recording and Incorporating Stakeholder Feedback into Machine Learning Pipelines - [2307.15475] [QA].
Is One Epoch All You Need For Multi-Fidelity Hyperparameter Optimization? - [2307.15422] [QA].
Uncertainty-aware Unsupervised Multi-Object Tracking - [2307.15409] [QA].
Supervised Homography Learning with Realistic Dataset Generation - [2307.15353] [QA].
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding - [2307.15337] [QA].
Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF - [2307.15333] [QA].
TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts - [2307.15324] [QA].
Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification - [2307.15254] [QA].
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback - [2307.15217] [QA].
PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization - [2307.15199] [QA].
Med-Flamingo: a Multimodal Medical Few-shot Learner - [2307.15189] [QA].
Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields - [2307.15131] [QA].
To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation - [2307.15063] [QA].
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation - [2308.07931] [QA].
Learning Depth Estimation for Transparent and Mirror Surfaces - [2307.15052] [QA].
Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models - [2307.15049] [QA].
Universal and Transferable Adversarial Attacks on Aligned Language Models - [2307.15043] [QA].
TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis - [2307.15042] [QA].
Diverse Inpainting and Editing with GAN Inversion - [2307.15033] [QA].
SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark - [2307.15020] [QA].
How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges - [2307.15016] [QA].
Scaling TransNormer to 175 Billion Parameters - [2307.14995] [QA].
S$^3$: Social-network Simulation System with Large Language Model-Empowered Agents - [2307.14984] [QA].
Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models - [2307.14971] [QA].
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback - [2307.14936] [QA].
Seeing through the Brain: Image Reconstruction of Visual Perception from Human Brain Signals - [2308.02510] [QA].
Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning - [2307.14786] [QA].
Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining - [2307.14768] [QA].
Test Time Adaptation for Blind Image Quality Assessment - [2307.14735] [QA].
P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds - [2307.14726] [QA].
Pre-training Vision Transformers with Very Limited Synthesized Images - [2307.14710] [QA].
Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation - [2307.14709] [QA].
360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking - [2307.14630] [QA].
NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection - [2307.14620] [QA].
TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation - [2307.14611] [QA].
Clustering based Point Cloud Representation Learning for 3D Analysis - [2307.14605] [QA].
Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition - [2307.14535] [QA].
MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation - [2307.14460] [QA].
Three Bricks to Consolidate Watermarks for Large Language Models - [2308.00113] [QA].
MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation - [2307.14336] [QA].
WavJourney: Compositional Audio Creation with Large Language Models - [2307.14335] [QA].
Towards Generalist Biomedical AI - [2307.14334] [QA].
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory - [2307.14277] [QA].
Large Language Models are Competitive Near Cold-start Recommenders for Language- and Item-based Preferences - [2307.14225] [QA].
ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation - [2307.14187] [QA].
Creative Birds: Self-Supervised Single-View 3D Style Transfer - [2307.14127] [QA].
Leveraging Implicit Feedback from Deployment Data in Dialogue - [2307.14117] [QA].
Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching - [2307.14071] [QA].
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models - [2307.14061] [QA].
3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability - [2307.14051] [QA].
Controllable Guide-Space for Generalizable Face Forgery Detection - [2307.14039] [QA].
Adaptive Frequency Filters As Efficient Global Token Mixers - [2307.14008] [QA].
Tracking Anything in High Quality - [2307.13974] [QA].
AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception - [2307.13933] [QA].
Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception - [2307.13929] [QA].
trajdata: A Unified Interface to Multiple Human Trajectory Datasets - [2307.13924] [QA].
Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation - [2307.13908] [QA].
WebArena: A Realistic Web Environment for Building Autonomous Agents - [2307.13854] [QA].
How to Scale Your EMA - [2307.13813] [QA].
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning - [2307.13770] [QA].
PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View - [2307.13756] [QA].
Foundational Models Defining a New Era in Vision: A Survey and Outlook - [2307.13721] [QA].
Composite Diffusion | whole >= Σparts - [2307.13720] [QA].
ARB: Advanced Reasoning Benchmark for Large Language Models - [2307.13692] [QA].
RecursiveDet: End-to-End Region-based Recursive Object Detection - [2307.13619] [QA].
Model Calibration in Dense Classification with Adaptive Label Perturbation - [2307.13539] [QA].
Spectrum-guided Multi-granularity Referring Video Object Segmentation - [2307.13537] [QA].
Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection - [2307.13529] [QA].
FacTool: Factuality Detection in Generative AI -- A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios - [2307.13528] [QA].
Weakly-supervised 3D Pose Transfer with Keypoints - [2307.13459] [QA].
Predicting Code Coverage without Execution - [2307.13383] [QA].
Unmasking Anomalies in Road-Scene Segmentation - [2307.13316] [QA].
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition - [2307.13269] [QA].
Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network - [2307.13254] [QA].
GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers - [2307.13251] [QA].
Strivec: Sparse Tri-Vector Radiance Fields - [2307.13226] [QA].
GraspGPT: Leveraging Semantic Knowledge from a Large Language Model for Task-Oriented Grasping - [2307.13204] [QA].
Contrastive Example-Based Control - [2307.13101] [QA].
LLM-Rec: Personalized Recommendation via Prompting Large Language Models - [2307.15780] [QA].
3D-LLM: Injecting the 3D World into Large Language Models - [2307.12981] [QA].
A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models - [2307.12980] [QA].
Evaluating the Ripple Effects of Knowledge Editing in Language Models - [2307.12976] [QA].
DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting - [2307.12972] [QA].
Aligning Large Language Models with Human: A Survey - [2307.12966] [QA].
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment - [2307.12950] [QA].
GridMM: Grid Memory Map for Vision-and-Language Navigation - [2307.12907] [QA].
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis - [2307.12856] [QA].
Multiscale Video Pretraining for Long-Term Activity Forecasting - [2307.12854] [QA].
Fast Full-frame Video Stabilization with Iterative Optimization - [2307.12774] [QA].
COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts - [2307.12730] [QA].
Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction - [2307.12729] [QA].
MC-JEPA: A Joint-Embedding Predictive Architecture for Self-Supervised Learning of Motion and Content Features - [2307.12698] [QA].
PG-RCNN: Semantic Surface Point Generation for 3D Object Detection - [2307.12637] [QA].
CTVIS: Consistent Training for Online Video Instance Segmentation - [2307.12616] [QA].
Less is More: Focus Attention for Efficient DETR - [2307.12612] [QA].
PRIOR: Prototype Representation Joint Learning from Medical Images and Reports - [2307.12577] [QA].
A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation - [2307.12574] [QA].
Interpolating between Images with Diffusion Models - [2307.12560] [QA].
PUMA: Secure Inference of LLaMA-7B in Five Minutes - [2307.12533] [QA].
Cross Contrasting Feature Perturbation for Domain Generalization - [2307.12502] [QA].
TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition - [2307.12493] [QA].
Rethinking Data Distillation: Do Not Overlook Calibration - [2307.12463] [QA].
ProtoFL: Unsupervised Federated Learning via Prototypical Distillation - [2307.12450] [QA].
Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection - [2307.12427] [QA].
Testing Hateful Speeches against Policies - [2307.12418] [QA].
Learning Navigational Visual Representations with Semantic Map Supervision - [2307.12335] [QA].
TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering - [2307.12291] [QA].
Downstream-agnostic Adversarial Examples - [2307.12280] [QA].
Geometry-Aware Adaptation for Pretrained Models - [2307.12226] [QA].
LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference - [2307.12217] [QA].
LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction - [2307.12194] [QA].
Optimized Network Architectures for Large Language Model Training with Billions of Parameters - [2307.12169] [QA].
Hallucination Improves the Performance of Unsupervised Visual Representation Learning - [2307.12168] [QA].
DIP-RL: Demonstration-Inferred Preference Learning in Minecraft - [2307.12158] [QA].
Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes - [2307.12101] [QA].
Discovering Spatio-Temporal Rationales for Video Question Answering - [2307.12058] [QA].
On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement - [2307.12027] [QA].
Learning Vision-and-Language Navigation from YouTube Videos - [2307.11984] [QA].
Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? - [2307.11978] [QA].
CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots - [2307.11865] [QA].
HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness - [2307.11823] [QA].
Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts - [2307.11661] [QA].
OxfordTVG-HIC: Can Machine Make Humorous Captions from Images? - [2307.11636] [QA].
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation - [2307.11545] [QA].
CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields - [2307.11526] [QA].
CORE: Cooperative Reconstruction for Multi-Agent Perception - [2307.11514] [QA].
SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection - [2307.11477] [QA].
Distribution Shift Matters for Knowledge Distillation with Webly Collected Images - [2307.11469] [QA].
Strip-MLP: Efficient Token Interaction for Vision MLP - [2307.11458] [QA].
Prompting Large Language Models with Speech Recognition Abilities - [2307.11795] [QA].
FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields - [2307.11418] [QA].
Deep Directly-Trained Spiking Neural Networks for Object Detection - [2307.11411] [QA].
Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning - [2307.11410] [QA].
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition - [2307.11404] [QA].
CLR: Channel-wise Lightweight Reprogramming for Continual Learning - [2307.11386] [QA].
What can a Single Attention Layer Learn? A Study Through the Random Features Lens - [2307.11353] [QA].
Tuning Pre-trained Model via Moment Probing - [2307.11342] [QA].
Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields - [2307.11335] [QA].
DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport - [2307.11308] [QA].
PourIt!: Weakly-supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring - [2307.11299] [QA].
MAS: Towards Resource-Efficient Federated Multiple-Task Learning - [2307.11285] [QA].
Brain2Music: Reconstructing Music from Human Brain Activity - [2307.11078] [QA].
AlignDet: Aligning Pre-training and Fine-tuning in Object Detection - [2307.11077] [QA].
Cascade-DETR: Delving into High-Quality Universal Object Detection - [2307.11035] [QA].
General Image-to-Image Translation with One-Shot Image Guidance - [2307.14352] [QA].
Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image - [2307.10984] [QA].
Improving Online Lane Graph Extraction by Object-Lane Clustering - [2307.10947] [QA].
Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery - [2307.10943] [QA].
PASTA: Pretrained Action-State Transformer Agents - [2307.10936] [QA].
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets - [2307.10928] [QA].
Diffusion Sampling with Momentum for Mitigating Divergence Artifacts - [2307.11118] [QA].
The Role of Entropy and Reconstruction in Multi-View Self-Supervised Learning - [2307.10907] [QA].
BlendFace: Re-designing Identity Encoders for Face-Swapping - [2307.10854] [QA].
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion - [2307.10816] [QA].
Meta-Transformer: A Unified Framework for Multimodal Learning - [2307.10802] [QA].
HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces - [2307.10797] [QA].
See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data - [2307.10782] [QA].
Urban Radiance Field Representation with Deformable Neural Mesh Primitives - [2307.10776] [QA].
Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV - [2307.10713] [QA].
Lighting up NeRF via Unsupervised Decomposition and Enhancement - [2307.10664] [QA].
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models - [2307.10635] [QA].
Physics-Driven Turbulence Image Restoration with Stochastic Refinement - [2307.10603] [QA].
Flatness-Aware Minimization for Domain Generalization - [2307.11108] [QA].
Instruction-following Evaluation through Verbalizer Manipulation - [2307.10558] [QA].
EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization - [2307.10554] [QA].
TokenFlow: Consistent Diffusion Features for Consistent Video Editing - [2307.10373] [QA].
DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-centric Rendering - [2307.10173] [QA].
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI - [2307.10172] [QA].
Challenges and Applications of Large Language Models - [2307.10169] [QA].
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs - [2307.10168] [QA].
Improving Multimodal Datasets with Image Captioning - [2307.10350] [QA].
FABRIC: Personalizing Diffusion Models with Iterative Feedback - [2307.10159] [QA].
Android in the Wild: A Large-Scale Dataset for Android Device Control - [2307.10088] [QA].
Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples - [2307.10062] [QA].
MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions - [2307.10008] [QA].
Hierarchical Spatio-Temporal Representation Learning for Gait Recognition - [2307.09856] [QA].
What do neural networks learn in image classification? A frequency shortcut perspective - [2307.09829] [QA].
Density-invariant Features for Distant Point Cloud Registration - [2307.09788] [QA].
Text2Layer: Layered Image Generation using Latent Diffusion Model - [2307.09781] [QA].
Towards Building More Robust Models with Frequency Bias - [2307.09763] [QA].
Generative Prompt Model for Weakly Supervised Object Localization - [2307.09756] [QA].
Space Engage: Collaborative Space Supervision for Contrastive-based Semi-Supervised Semantic Segmentation - [2307.09755] [QA].
CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation - [2307.10316] [QA].
AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks - [2307.09724] [QA].
Towards Saner Deep Image Registration - [2307.09696] [QA].
GlobalMapper: Arbitrary-Shaped Urban Layout Generation - [2307.09693] [QA].
Towards A Unified Agent with Foundation Models - [2307.09668] [QA].
Object-aware Gaze Target Detection - [2307.09662] [QA].
Promoting Exploration in Memory-Augmented Adam using Critical Momenta - [2307.09638] [QA].
Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration - [2307.09621] [QA].
ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning - [2307.09474] [QA].
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla - [2307.09458] [QA].
OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation - [2307.09356] [QA].
Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis - [2307.09323] [QA].
Biomaker CA: a Biome Maker project using Cellular Automata - [2307.09320] [QA].
EigenTrajectory: Low-Rank Descriptors for Multi-Modal Trajectory Forecasting - [2307.09306] [QA].
Llama 2: Open Foundation and Fine-Tuned Chat Models - [2307.09288] [QA].
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding - [2307.09267] [QA].
Augmenting CLIP with Improved Visio-Linguistic Reasoning - [2307.09233] [QA].
NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF - [2307.09112] [QA].
LA-Net: Landmark-Aware Learning for Reliable Facial Expression Recognition under Label Noise - [2307.09023] [QA].
How is ChatGPT's behavior changing over time? - [2307.09009] [QA].
Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction - [2307.09004] [QA].
Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond - [2307.08996] [QA].
Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels - [2307.08809] [QA].
Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation - [2307.08779] [QA].
GEAR: Augmenting Language Models with Generalizable and Efficient Tool Resolution - [2307.08775] [QA].
Diffusion Models Beat GANs on Image Classification - [2307.08702] [QA].
AlpaGasus: Training A Better Alpaca with Fewer Data - [2307.08701] [QA].
Neural Video Depth Stabilizer - [2307.08695] [QA].
TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT - [2307.08674] [QA].
Retentive Network: A Successor to Transformer for Large Language Models - [2307.08621] [QA].
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs - [2307.08581] [QA].
Scale-Aware Modulation Meet Transformer - [2307.08579] [QA].
Does Visual Pretraining Help End-to-End Reasoning? - [2307.08506] [QA].
BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up Patch Summarization - [2307.08504] [QA].
Cumulative Spatial Knowledge Distillation for Vision Transformers - [2307.08500] [QA].
Differentiable Transportation Pruning - [2307.08483] [QA].
SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training - [2307.08476] [QA].
Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation - [2307.08448] [QA].
DOT: A Distillation-Oriented Trainer - [2307.08436] [QA].
On the application of Large Language Models for language teaching and assessment technology - [2307.08393] [QA].
Dynamic Snake Convolution based on Topological Geometric Constraints for Tubular Structure Segmentation - [2307.08388] [QA].
Self-supervised Monocular Depth Estimation: Let's Talk About The Weather - [2307.08357] [QA].
ShiftNAS: Improving One-shot NAS via Probability Shift - [2307.08300] [QA].
Random Boxes Are Open-world Object Detectors - [2307.08249] [QA].
Towards Self-Assembling Artificial Neural Networks through Neural Developmental Programs - [2307.08197] [QA].
Measuring Faithfulness in Chain-of-Thought Reasoning - [2307.13702] [QA].
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning - [2307.11768] [QA].
Feedback is All You Need: Real-World Reinforcement Learning with Approximate Physics-Based Models - [2307.08168] [QA].
Planting a SEED of Vision in Large Language Model - [2307.08041] [QA].
Multi-Object Discovery by Low-Dimensional Object Motion - [2307.08027] [QA].
Householder Projector for Unsupervised Latent Semantics Discovery - [2307.08012] [QA].
Towards Viewpoint-Invariant Visual Recognition via Adversarial Training - [2307.10235] [QA].
Language Conditioned Traffic Generation - [2307.07947] [QA].
Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling - [2307.07944] [QA].
CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion - [2307.07938] [QA].
Communicative Agents for Software Development - [2307.07924] [QA].
Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training - [2307.07909] [QA].
Handwritten and Printed Text Segmentation: A Signature Case Study - [2307.07887] [QA].
Unified Adversarial Patch for Cross-modal Attacks in the Physical World - [2307.07859] [QA].
Adaptive Nonlinear Latent Transformation for Conditional Face Editing - [2307.07790] [QA].
Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer - [2307.07754] [QA].
INVE: Interactive Neural Video Editing - [2307.07663] [QA].
RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World - [2307.07653] [QA].
CoTracker: It is Better to Track Together - [2307.07635] [QA].
NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis - [2307.07511] [QA].
DreamTeacher: Pretraining Image Backbones with Deep Generative Models - [2307.07487] [QA].
Multimodal Distillation for Egocentric Action Recognition - [2307.07483] [QA].
Improving Zero-Shot Generalization for CLIP with Synthesized Prompts - [2307.07397] [QA].
Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning - [2307.07250] [QA].
FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation - [2307.07245] [QA].
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts - [2307.07218] [QA].
Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection - [2307.07205] [QA].
Learning to Retrieve In-Context Examples for Large Language Models - [2307.07164] [QA].
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training - [2307.07063] [QA].
DIALGEN: Collaborative Human-LM Generated Dialogues for Improved Understanding of Human-Human Conversations - [2307.07047] [QA].
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models - [2307.06949] [QA].
In-context Autoencoder for Context Compression in a Large Language Model - [2307.06945] [QA].
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation - [2307.06942] [QA].
Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation - [2307.06940] [QA].
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs - [2307.06930] [QA].
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models - [2307.06925] [QA].
Generating Benchmarks for Factuality Evaluation of Language Models - [2307.06908] [QA].
Copy Is All You Need - [2307.06962] [QA].
Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews - [2307.06464] [QA].
Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events - [2307.06439] [QA].
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation - [2307.06350] [QA].
Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution - [2307.06304] [QA].
Instruction Mining: High-Quality Instruction Data Selection for Large Language Models - [2307.06290] [QA].
MMBench: Is Your Multi-modal Model an All-around Player? - [2307.06281] [QA].
SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning - [2307.06135] [QA].
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View - [2307.06082] [QA].
PolyLM: An Open Source Polyglot Large Language Model - [2307.06018] [QA].
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models - [2307.05973] [QA].
Giving Robots a Hand: Learning Generalizable Manipulation with Eye-in-Hand Human Video Demonstrations - [2307.05959] [QA].
GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video - [2307.05853] [QA].
Towards Robust and Efficient Continual Language Learning - [2307.05741] [QA].
Stack More Layers Differently: High-Rank Training Through Low-Rank Updates - [2307.05695] [QA].
Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives - [2307.05473] [QA].
Self-consistency for open-ended generations - [2307.06857] [QA].
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone - [2307.05463] [QA].
Efficient 3D Articulated Human Generation with Layered Surface Volumes - [2307.05462] [QA].
Empowering Cross-lingual Behavioral Testing of NLP Models with Typological Features - [2307.05454] [QA].
Self-Supervised Learning with Lie Symmetries for Partial Differential Equations - [2307.05432] [QA].
Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration - [2307.05300] [QA].
Generative Pretraining in Multimodality - [2307.05222] [QA].
DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks - [2307.05628] [QA].
Test-Time Training on Video Streams - [2307.05014] [QA].
Monotone deep Boltzmann machines - [2307.04990] [QA].
Secrets of RLHF in Large Language Models Part I: PPO - [2307.04964] [QA].
Semantic-SAM: Segment and Recognize Anything at Any Granularity - [2307.04767] [QA].
SITTA: A Semantic Image-Text Alignment for Image Captioning - [2307.05591] [QA].
Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement - [2307.04751] [QA].
RoCo: Dialectic Multi-Robot Collaboration with Large Language Models - [2307.04738] [QA].
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning - [2307.04725] [QA].
Large Language Models as General Pattern Machines - [2307.04721] [QA].
International Institutions for Advanced AI - [2307.04699] [QA].
VampNet: Music Generation via Masked Acoustic Token Modeling - [2307.04686] [QA].
AnyTeleop: A General Vision-Based Dexterous Robot Arm-Hand Teleoperation System - [2307.04577] [QA].
Improving Factuality of Abstractive Summarization via Contrastive Reward Learning - [2307.04507] [QA].
RLTF: Reinforcement Learning from Unit Test Feedback - [2307.04349] [QA].
Convex Decomposition of Indoor Scenes - [2307.04246] [QA].
Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's Eye View - [2307.04106] [QA].
SVIT: Scaling up Visual Instruction Tuning - [2307.04087] [QA].
Toward Interactive Dictation - [2307.04008] [QA].
On decoder-only architecture for speech-to-text and large language model integration - [2307.03917] [QA].
Large Language Models for Supply Chain Optimization - [2307.03875] [QA].
Sketch-A-Shape: Zero-Shot Sketch-to-3D Shape Generation - [2307.03869] [QA].
AutoDecoding Latent 3D Diffusion Models - [2307.05445] [QA].
Equivariant Single View Pose Prediction Via Induced and Restricted Representations - [2307.03704] [QA].
Decomposing the Generalization Gap in Imitation Learning for Visual Robotic Manipulation - [2307.03659] [QA].
GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest - [2307.03601] [QA].
One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention - [2307.03576] [QA].
Discovering Hierarchical Achievements in Reinforcement Learning via Contrastive Learning - [2307.03486] [QA].
Solvent: A Framework for Protein Folding - [2307.04603] [QA].
Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning - [2307.03406] [QA].
Teaching Arithmetic to Small Transformers - [2307.03381] [QA].
BiPhone: Modeling Inter Language Phonetic Influences in Text - [2307.03322] [QA].
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers - [2307.03183] [QA].
Lost in the Middle: How Language Models Use Long Contexts - [2307.03172] [QA].
Focused Transformer: Contrastive Training for Context Scaling - [2307.03170] [QA].
VideoGLUE: Video General Understanding Evaluation of Foundation Models - [2307.03166] [QA].
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability - [2307.03135] [QA].
Frontier AI Regulation: Managing Emerging Risks to Public Safety - [2307.03718] [QA].
A Survey on Evaluation of Large Language Models - [2307.03109] [QA].
Improving Retrieval-Augmented Large Language Models via Data Importance Learning - [2307.03027] [QA].
Style Over Substance: Evaluation Biases for Large Language Models - [2307.03025] [QA].
Contrast Is All You Need - [2307.02882] [QA].
What Should Data Science Education Do with Large Language Models? - [2307.02792] [QA].
Training Models to Generate, Recognize, and Reframe Unhelpful Thoughts - [2307.02768] [QA].
Wireless Multi-Agent Generative AI: From Connected Intelligence to Collective Intelligence - [2307.02757] [QA].
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference - [2307.02628] [QA].
LongNet: Scaling Transformers to 1,000,000,000 Tokens - [2307.02486] [QA].
Building Cooperative Embodied Agents Modularly with Large Language Models - [2307.02485] [QA].
Elastic Decision Transformer - [2307.02484] [QA].
Jailbroken: How Does LLM Safety Training Fail? - [2307.02483] [QA].
Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks - [2307.02477] [QA].
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs? - [2307.02469] [QA].
Using Rewrite Strategies for Efficient Functional Automatic Differentiation - [2307.02447] [QA].
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models - [2307.02421] [QA].
MSViT: Dynamic Mixed-Scale Tokenization for Vision Transformers - [2307.02321] [QA].
Rethinking Multiple Instance Learning for Whole Slide Image Classification: A Good Instance Classifier is All You Need - [2307.02249] [QA].
Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks - [2307.02179] [QA].
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning - [2307.03692] [QA].
Flacuna: Unleashing the Problem Solving Power of Vicuna using FLAN Fine-Tuning - [2307.02053] [QA].
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis - [2307.01952] [QA].
Physics-based Motion Retargeting from Sparse Inputs - [2307.01938] [QA].
Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners - [2307.01928] [QA].
Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning - [2307.01849] [QA].
Embodied Task Planning with Large Language Models - [2307.01848] [QA].
Collaborative Score Distillation for Consistent Visual Synthesis - [2307.04787] [QA].
DiT-3D: Exploring Plain Diffusion Transformers for 3D Shape Generation - [2307.01831] [QA].
Pretraining is All You Need: A Multi-Atlas Enhanced Transformer Framework for Autism Spectrum Disorder Classification - [2307.01759] [QA].
Synthetic is all you need: removing the auxiliary data assumption for membership inference attacks against synthetic data - [2307.01701] [QA].
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding - [2307.02499] [QA].
ChildPlay: A New Benchmark for Understanding Children's Gaze Behaviour - [2307.01630] [QA].
On Hofstadter's G-sequence - [2307.1471] [QA].
Hybrid two-level MCMC for Bayesian Inverse Problems - [2307.1463] [QA].
Practical Collaborative Perception: A Framework for Asynchronous and Multi-Agent 3D Object Detection - [2307.1462] [QA].
Multi-Task Learning Improves Performance In Deep Argument Mining Models - [2307.1401] [QA].
EIGER IV: The cool 10$^4$K circumgalactic environment of high-$z$ galaxies reveals remarkably efficient IGM enrichment - [2307.1273] [QA].
Real-time Monocular Full-body Capture in World Space via Sequential Proxy-to-Motion Learning - [2307.01200] [QA].
Segment Anything Meets Point Tracking - [2307.01197] [QA].
Variational integrals on Hessian spaces: partial regularity for critical points - [2307.1191] [QA].
Characterisation of three-body loss in ${}^{166}$Er and optimised production of large Bose-Einstein condensates - [2307.1245] [QA].
Improving Language Plasticity via Pretraining with Active Forgetting - [2307.01163] [QA].
SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions - [2307.01139] [QA].
MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion - [2307.01097] [QA].
Scalable quantum neural networks by few quantum resources - [2307.1017] [QA].
Visual Instruction Tuning with Polite Flamingo - [2307.01003] [QA].
NOMA-Assisted Grant-Free Transmission: How to Design Pre-Configured SNR Levels? - [2307.0990] [QA].
Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset - [2307.00818] [QA].
SketchMetaFace: A Learning-based Sketching Interface for High-fidelity 3D Character Face Modeling - [2307.00804] [QA].
EmoGen: Eliminating Subjective Bias in Emotional Music Generation - [2307.01229] [QA].
JourneyDB: A Benchmark for Generative Image Understanding - [2307.00716] [QA].
LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance - [2307.00522] [QA].
Almost sure bounds for a weighted Steinhaus random multiplicative function - [2307.0499] [QA].
One Copy Is All You Need: Resource-Efficient Streaming of Medical Imaging Data at Scale - [2307.00438] [QA].
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models - [2307.00398] [QA].
DoReMi: Grounding Language Model by Detecting and Recovering from Plan-Execution Misalignment - [2307.00329] [QA].
Personality Traits in Large Language Models - [2307.00184] [QA].

June 2023

Meta-training with Demonstration Retrieval for Efficient Few-shot Learning - [2307.00119] [QA].
Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control - [2307.00117] [QA].
Hardwiring ViT Patch Selectivity into CNNs using Patch Mixing - [2306.17848] [QA].
Magic123: One Image to High-Quality 3D Object Generation Using Both 2D and 3D Diffusion Priors - [2306.17843] [QA].
SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs - [2306.17842] [QA].
Statler: State-Maintaining Language Models for Embodied Reasoning - [2306.17840] [QA].
DisCo: Disentangled Control for Referring Human Dance Generation in Real World - [2307.00040] [QA].
Stay on topic with Classifier-Free Guidance - [2306.17806] [QA].
Topologically Attributed Graphs for Shape Discrimination - [2306.17805] [QA].
The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit - [2306.17759] [QA].
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting - [2306.17563] [QA].
Preference Ranking Optimization for Human Alignment - [2306.17492] [QA].
ReMaX: Relaxing for Better Training on Efficient Panoptic Segmentation - [2306.17319] [QA].
Towards Zero-Shot Scale-Aware Monocular Depth Estimation - [2306.17253] [QA].
Generative AI for Programming Education: Benchmarking ChatGPT, GPT-4, and Human Tutors - [2306.17156] [QA].
Generate Anything Anywhere in Any Scene - [2306.17154] [QA].
Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation - [2306.17115] [QA].
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding - [2306.17107] [QA].
End-to-end Autonomous Driving: Challenges and Frontiers - [2306.16927] [QA].
BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion - [2306.16940] [QA].
DreamDiffusion: Generating High-Quality Images from Brain EEG Signals - [2306.16934] [QA].
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization - [2306.16928] [QA].
NeuralFuse: Learning to Improve the Accuracy of Access-Limited Neural Network Inference in Low-Voltage Regimes - [2306.16869] [QA].
ArrayBot: Reinforcement Learning for Generalizable Distributed Manipulation through Touch - [2306.16857] [QA].
Benchmarking Large Language Model Capabilities for Conditional Generation - [2306.16793] [QA].
Dynamic-Resolution Model Learning for Object Pile Manipulation - [2306.16700] [QA].
KITE: Keypoint-Conditioned Policies for Semantic Manipulation - [2306.16605] [QA].
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs - [2306.16601] [QA].
LLM Calibration and Automatic Hallucination Detection via Pareto Optimal Self-supervision - [2306.16564] [QA].
Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language - [2306.16410] [QA].
On the Exploitability of Instruction Tuning - [2306.17194] [QA].
Towards Measuring the Representation of Subjective Global Opinions in Language Models - [2306.16388] [QA].
Inferring the Goals of Communicating Agents from Actions and Instructions - [2306.16207] [QA].
SVNR: Spatially-variant Noise Removal with Denoising Diffusion - [2306.16052] [QA].
Positive Label Is All You Need for Multi-Label Classification - [2306.16016] [QA].
Accelerating Transducers through Adjacent Token Merging - [2306.16009] [QA].
Confidence Ranking for CTR Prediction - [2307.1206] [QA].
Subclass-balancing Contrastive Learning for Long-tailed Recognition - [2306.15925] [QA].
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias - [2306.15895] [QA].
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution - [2306.15794] [QA].
REFLECT: Summarizing Robot Experiences for Failure Explanation and Correction - [2306.15724] [QA].
PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment - [2306.15667] [QA].
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy - [2306.15658] [QA].
Asynchronous Algorithmic Alignment with Cocycles - [2306.15632] [QA].
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models - [2306.15626] [QA].
Extending Context Window of Large Language Models via Positional Interpolation - [2306.15595] [QA].
Explainable Multimodal Emotion Reasoning - [2306.15401] [QA].
Length Generalization in Arithmetic Transformers - [2306.15400] [QA].
3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement - [2306.15354] [QA].
MindDial: Belief Dynamics Tracking with Theory-of-Mind Modeling for Situated Neural Dialogue Generation - [2306.15253] [QA].
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic - [2306.15195] [QA].
MIMIC: Masked Image Modeling with Image Correspondences - [2306.15128] [QA].
Understanding In-Context Learning via Supportive Pretraining Data - [2306.15091] [QA].
RVT: Robotic View Transformer for 3D Object Manipulation - [2306.14896] [QA].
Supervised Pretraining Can Learn In-Context Reinforcement Learning - [2306.14892] [QA].
Restart Sampling for Improving Generative Processes - [2306.14878] [QA].
Are aligned neural networks adversarially aligned? - [2306.15447] [QA].
ViNT: A Foundation Model for Visual Navigation - [2306.14846] [QA].
Kosmos-2: Grounding Multimodal Large Language Models to the World - [2306.14824] [QA].
MotionGPT: Human Motion as a Foreign Language - [2306.14795] [QA].
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality - [2306.14610] [QA].
Aligning Large Multi-Modal Model with Robust Instruction Tuning - [2306.14565] [QA].
A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis - [2306.14544] [QA].
CEIL: Generalized Contextual Imitation Learning - [2306.14534] [QA].
ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks - [2306.14525] [QA].
RoboCook: Long-Horizon Elasto-Plastic Object Manipulation with Diverse Tools - [2306.14447] [QA].
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing - [2306.14435] [QA].
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications - [2306.14289] [QA].
BiFF: Bi-level Future Fusion with Polyline-based Coordinate for Interactive Trajectory Prediction - [2306.14161] [QA].
DomainStudio: Fine-Tuning Diffusion Models for Domain-Driven Image Generation using Limited Data - [2306.14153] [QA].
Language models are weak learners - [2306.14101] [QA].
SEEDS: Emulation of Weather Forecast Ensembles with Diffusion Models - [2306.14066] [QA].
DesCo: Learning Object Recognition with Rich Language Descriptions - [2306.14060] [QA].
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models - [2306.14048] [QA].
Thinking Like an Annotator: Generation of Dataset Labeling Instructions - [2306.14035] [QA].
Cross-Validation Is All You Need: A Statistical Approach To Label Noise Estimation - [2306.13990] [QA].
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data - [2306.13840] [QA].
LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding - [2306.14924] [QA].
Swin-Free: Achieving Better Cross-Window Attention and Efficiency with Size-varying Window - [2306.13776] [QA].
Zero-shot spatial layout conditioning for text-to-image diffusion models - [2306.13754] [QA].
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models - [2306.13651] [QA].
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models - [2306.13649] [QA].
OpenMask3D: Open-Vocabulary 3D Instance Segmentation - [2306.13631] [QA].
System-Level Natural Language Feedback - [2306.13588] [QA].
Scaling MLPs: A Tale of Inductive Bias - [2306.13575] [QA].
A Survey on Multimodal Large Language Models - [2306.13549] [QA].
DreamEditor: Text-Driven 3D Scene Editing with Neural Fields - [2306.13455] [QA].
Long-range Language Modeling with Self-retrieval - [2306.13421] [QA].
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models - [2306.13394] [QA].
Evading Forensic Classifiers with Attribute-Conditioned Adversarial Faces - [2306.13091] [QA].
Continuous Layout Editing of Single Images with Diffusion Models - [2306.13078] [QA].
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - [2306.12929] [QA].
AudioPaLM: A Large Language Model That Can Speak and Listen - [2306.12925] [QA].
Learning from Visual Observation via Offline Pretrained State-to-Go Transformer - [2306.12860] [QA].
Blended-NeRF: Zero-Shot Object Generation and Blending in Existing Neural Radiance Fields - [2306.12760] [QA].
SoftGPT: Learn Goal-oriented Soft Object Manipulation Skills by Generative Pre-trained Heterogeneous Graph Transformer - [2306.12677] [QA].
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought - [2306.12672] [QA].
Towards Regulatable AI Systems: Technical Gaps and Policy Opportunities - [2306.12609] [QA].
Local 3D Editing via 3D Distillation of CLIP Knowledge - [2306.12570] [QA].
FFCV: Accelerating Training by Removing Data Bottlenecks - [2306.12517] [QA].
Deep Language Networks: Joint Prompt Training of Stacked LLMs using Variational Inference - [2306.12509] [QA].
DreamTime: An Improved Optimization Strategy for Text-to-3D Content Creation - [2306.12422] [QA].
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents - [2306.16527] [QA].
Fast Segment Anything - [2306.12156] [QA].
Mass-Producing Failures of Multimodal Systems with Language Models - [2306.12105] [QA].
HSR-Diff:Hyperspectral Image Super-Resolution via Conditional Diffusion Models - [2306.12085] [QA].
EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations - [2306.12059] [QA].
Training Transformers with 4-bit Integers - [2306.11987] [QA].
Opportunities and Risks of LLMs for Scalable Deliberation with Polis - [2306.11932] [QA].
Randomized Quantization is All You Need for Differential Privacy in Federated Learning - [2306.11913] [QA].
SPRINT: Scalable Policy Pre-Training via Language Instruction Relabeling - [2306.11886] [QA].
Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision - [2306.11719] [QA].
RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation - [2306.11706] [QA].
Textbooks Are All You Need - [2306.11644] [QA].
Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion - [2306.11593] [QA].
HomeRobot: Open-Vocabulary Mobile Manipulation - [2306.11565] [QA].
Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs - [2306.11536] [QA].
RM-PRT: Realistic Robotic Manipulation Simulator and Benchmark with Progressive Reasoning Tasks - [2306.11335] [QA].
Dynamic Perceiver for Efficient Visual Recognition - [2306.11248] [QA].
Quilt-1M: One Million Image-Text Pairs for Histopathology - [2306.11207] [QA].
Large Language Models are Fixated by Red Herrings: Exploring Creative Problem Solving and Einstellung Effect using the Only Connect Wall Dataset - [2306.11167] [QA].
FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology Structure and Knowledge Distillation - [2306.11046] [QA].
RepoFusion: Training Code Models to Understand Your Repository - [2306.10998] [QA].
BayLing: Bridging Cross-lingual Alignment and Instruction Following through Interactive Translation for Large Language Models - [2306.10968] [QA].
MotionGPT: Finetuned LLMs are General-Purpose Motion Generators - [2306.10900] [QA].
3D VR Sketch Guided 3D Shape Prototyping and Exploration - [2306.10830] [QA].
Multitrack Music Transcription with a Time-Frequency Perceiver - [2306.10785] [QA].
Guiding Language Models of Code with Global Context using Monitors - [2306.10763] [QA].
UniMC: A Unified Framework for Long-Term Memory Conversation via Relevance Representation Learning - [2306.10543] [QA].
Point-Cloud Completion with Pretrained Text-to-image Diffusion Models - [2306.10533] [QA].
CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents - [2306.10376] [QA].
GLIMMER: generalized late-interaction memory reranker - [2306.10231] [QA].
ZeRO++: Extremely Efficient Collective Communication for Giant Model Training - [2306.10209] [QA].
Meta-Personalizing Vision-Language Models to Find Named Instances in Video - [2306.10169] [QA].
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing - [2306.10012] [QA].
CLIP2Protect: Protecting Facial Privacy using Text-Guided Makeup via Adversarial Latent Search - [2306.10008] [QA].
Robot Learning with Sensorimotor Pre-training - [2306.10007] [QA].
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering - [2306.09996] [QA].
Evaluating Superhuman Models with Consistency Checks - [2306.09983] [QA].
LabelBench: A Comprehensive Framework for Benchmarking Label-Efficient Learning - [2306.09910] [QA].
Demystifying GPT Self-Repair for Code Generation - [2306.09896] [QA].
AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation - [2306.09864] [QA].
Full Parameter Fine-tuning for Large Language Models with Limited Resources - [2306.09782] [QA].
Gradient is All You Need? - [2306.09778] [QA].
Scaling Open-Vocabulary Object Detection - [2306.09683] [QA].
OCTScenes: A Versatile Real-World Dataset of Tabletop Scenes for Object-Centric Learning - [2306.09682] [QA].
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models - [2306.09635] [QA].
CAJun: Continuous Adaptive Jumping using a Learned Centroidal Controller - [2306.09557] [QA].
Block-State Transformer - [2306.09539] [QA].
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models - [2306.11732] [QA].
Inverse Scaling: When Bigger Isn't Better - [2306.09479] [QA].
Explore, Establish, Exploit: Red Teaming Language Models from Scratch - [2306.09442] [QA].
Seeing the World through Your Eyes - [2306.09348] [QA].
UrbanIR: Large-Scale Urban Scene Inverse Rendering from a Single Video - [2306.09349] [QA].
Rosetta Neurons: Mining the Common Units in a Model Zoo - [2306.09346] [QA].
Evaluating Data Attribution for Text-to-Image Models - [2306.09345] [QA].
Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis - [2306.09341] [QA].
DreamHuman: Animatable 3D Avatars from Text - [2306.09329] [QA].
Language-Guided Music Recommendation for Video via Prompt Analogies - [2306.09327] [QA].
Neural Relighting with Subsurface Scattering by Learning the Radiance Transfer Gradient - [2306.09322] [QA].
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation - [2306.09316] [QA].
Can Language Models Teach Weaker Agents? Teacher Explanations Improve Students via Theory of Mind - [2306.09299] [QA].
KoLA: Carefully Benchmarking World Knowledge of Large Language Models - [2306.09296] [QA].
A9 Intersection Dataset: All You Need for Urban 3D Camera-LiDAR Roadside Perception - [2306.09266] [QA].
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models - [2306.09265] [QA].
Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories - [2306.09224] [QA].
CMMLU: Measuring massive multitask language understanding in Chinese - [2306.09212] [QA].
NAVI: Category-Agnostic Image Collections with High-Quality 3D Shape and Pose Annotations - [2306.09109] [QA].
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration - [2306.09093] [QA].
Behavioral Cloning via Search in Embedded Demonstration Dataset - [2306.09082] [QA].
Re-Benchmarking Pool-Based Active Learning for Binary Classification - [2306.08954] [QA].
LOVM: Language-Only Vision Model Selection - [2306.08893] [QA].
EPIC Fields: Marrying 3D Geometry and Video Understanding - [2306.08731] [QA].
VidEdit: Zero-Shot and Spatially Aware Text-Driven Video Editing - [2306.08707] [QA].
Toward Grounded Social Reasoning - [2306.08651] [QA].
Language to Rewards for Robotic Skill Synthesis - [2306.08647] [QA].
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models - [2306.08641] [QA].
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn - [2306.08640] [QA].
TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement - [2306.08637] [QA].
Anticipatory Music Transformer - [2306.08620] [QA].
WizardCoder: Empowering Code Large Language Models with Evol-Instruct - [2306.08568] [QA].
Knowledge Distillation of Large Language Models - [2306.08543] [QA].
TryOnDiffusion: A Tale of Two UNets - [2306.08276] [QA].
Contrastive Loss is All You Need to Recover Analogies as Parallel Lines - [2306.08221] [QA].
Agile Catching with Whole-Body MPC and Blackbox Policy Learning - [2306.08205] [QA].
h2oGPT: Democratizing Large Language Models - [2306.08161] [QA].
Large-scale Language Model Rescoring on Long-form Data - [2306.08133] [QA].
AVIS: Autonomous Visual Information Seeking with Large Language Models - [2306.08129] [QA].
DORSal: Diffusion for Object-centric Representations of Scenes $\textit{et al.}$ - [2306.08068] [QA].
Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training - [2306.08055] [QA].
Efficient 3D Semantic Segmentation with Superpoint Transformer - [2306.08045] [QA].
Neural Scene Chronology - [2306.07970] [QA].
GeneCIS: A Benchmark for General Conditional Image Similarity - [2306.07969] [QA].
arXiVeri: Automatic table verification with GPT - [2306.07968] [QA].
One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning - [2306.07967] [QA].
Hidden Biases of End-to-End Driving Models - [2306.07957] [QA].
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation - [2306.07954] [QA].
Questioning the Survey Responses of Large Language Models - [2306.07951] [QA].
Image Captioners Are Scalable Vision Learners Too - [2306.07915] [QA].
WebGLM: Towards An Efficient Web-Enhanced Question Answering System with Human Preferences - [2306.07906] [QA].
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data - [2306.07881] [QA].
Area is all you need: repeatable elements make stronger adversarial attacks - [2306.07768] [QA].
E2E-LOAD: End-to-End Long-form Online Action Detection - [2306.07703] [QA].
SayTap: Language to Quadrupedal Locomotion - [2306.07580] [QA].
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second - [2306.07552] [QA].
TART: A plug-and-play Transformer module for task-agnostic reasoning - [2306.07536] [QA].
Require Process Control? LSTMc is all you need! - [2306.07510] [QA].
AniFaceDrawing: Anime Portrait Exploration during Your Sketching - [2306.07476] [QA].
3D molecule generation by denoising voxel grids - [2306.07473] [QA].
Instant Multi-View Head Capture through Learnable Registration - [2306.07437] [QA].
Controlling Text-to-Image Diffusion by Orthogonal Finetuning - [2306.07280] [QA].
Scalable 3D Captioning with Pretrained Models - [2306.07279] [QA].
Retrieval-Enhanced Contrastive Vision-Text Models - [2306.07196] [QA].
Benchmarking Neural Network Training Algorithms - [2306.07179] [QA].
Augmenting Language Models with Long-Term Memory - [2306.07174] [QA].
Transformers learn through gradual rank increase - [2306.07042] [QA].
Small Temperature is All You Need for Differentiable Architecture Search - [2306.06855] [QA].
Weakly supervised information extraction from inscrutable handwritten document images - [2306.06823] [QA].
Attention, Compilation, and Solver-based Symbolic Analysis are All You Need - [2306.06755] [QA].
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark - [2306.06687] [QA].
Face0: Instantaneously Conditioning a Text-to-Image Model on a Face - [2306.06638] [QA].
RestGPT: Connecting Large Language Models with Real-World RESTful APIs - [2306.06624] [QA].
High-Fidelity Audio Compression with Improved RVQGAN - [2306.06546] [QA].
Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration - [2306.06513] [QA].
Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions - [2306.06212] [QA].
FasterViT: Fast Vision Transformers with Hierarchical Attention - [2306.06189] [QA].
Value function estimation using conditional diffusion models for control - [2306.07290] [QA].
Realistic Saliency Guided Image Enhancement - [2306.06092] [QA].
Mind2Web: Towards a Generalist Agent for the Web - [2306.06070] [QA].
GANeRF: Leveraging Discriminators to Optimize Neural Radiance Fields - [2306.06044] [QA].
DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds - [2306.06023] [QA].
S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput - [2306.06000] [QA].
GPT-Calls: Enhancing Call Segmentation and Tagging by Generating Synthetic Conversations via Large Language Models - [2306.07941] [QA].
Evaluating the Social Impact of Generative AI Systems in Systems and Society - [2306.05949] [QA].
Can Large Language Models Infer Causation from Correlation? - [2306.05836] [QA].
Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation - [2306.05783] [QA].
Embodied Executable Policy Learning with Language-based Scene Summarization - [2306.05696] [QA].
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena - [2306.05685] [QA].
On the Importance of Feature Decorrelation for Unsupervised Representation Learning in Reinforcement Learning - [2306.05637] [QA].
Speech-to-Text Adapter and Speech-to-Entity Retriever Augmented LLMs for Speech Understanding - [2306.07944] [QA].
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping - [2306.05544] [QA].
Multi-Modal Classifiers for Open-Vocabulary Object Detection - [2306.05493] [QA].
Grounded Text-to-Image Synthesis with Attention Refocusing - [2306.05427] [QA].
Background Prompting for Improved Object Depth - [2306.05428] [QA].
MIMIC-IT: Multi-Modal In-Context Instruction Tuning - [2306.05425] [QA].
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models - [2306.05424] [QA].
Tracking Everything Everywhere All at Once - [2306.05422] [QA].
Scaling Spherical CNNs - [2306.05420] [QA].
R-MAE: Regions Meet Masked Autoencoders - [2306.05411] [QA].
LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs - [2306.05410] [QA].
Matting Anything - [2306.05399] [QA].
Modular Visual Question Answering via Code Generation - [2306.05392] [QA].
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models - [2306.05357] [QA].
Simple and Controllable Music Generation - [2306.05284] [QA].
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models - [2306.05179] [QA].
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions - [2306.05178] [QA].
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization - [2306.05087] [QA].
ScaleDet: A Scalable Multi-Dataset Object Detector - [2306.04849] [QA].
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts - [2306.04845] [QA].
Optimizing ViViT Training: Time and Memory Reduction for Action Recognition - [2306.04822] [QA].
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models - [2306.04757] [QA].
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources - [2306.04751] [QA].
Improving Open Language Models by Learning from Organic Interactions - [2306.04707] [QA].
On the Reliability of Watermarks for Large Language Models - [2306.04634] [QA].
Designing a Better Asymmetric VQGAN for StableDiffusion - [2306.04632] [QA].
ARTIC3D: Learning Robust Articulated 3D Shapes from Noisy Web Image Collections - [2306.04619] [QA].
PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts - [2306.04528] [QA].
Improving neural network representations using human similarity judgments - [2306.04507] [QA].
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards - [2306.04488] [QA].
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning - [2306.04387] [QA].
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks - [2306.04362] [QA].
MobileNMT: Enabling Translation in 15MB and 30ms - [2306.04235] [QA].
Benchmarking Foundation Models with Language-Model-as-an-Examiner - [2306.04181] [QA].
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions - [2306.04140] [QA].
Text-only Domain Adaptation using Unified Speech-Text Representation in Transducer - [2306.04076] [QA].
Transferable Adversarial Robustness for Categorical Data via Universal Robust Embeddings - [2306.04064] [QA].
LLMZip: Lossless Text Compression using Large Language Models - [2306.04050] [QA].
Certified Reasoning with Language Models - [2306.04031] [QA].
Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks - [2306.04009] [QA].
ATT3D: Amortized Text-to-3D Object Synthesis - [2306.07349] [QA].
ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory - [2306.03901] [QA].
Emergent Correspondence from Image Diffusion - [2306.03881] [QA].
Deductive Verification of Chain-of-Thought Reasoning - [2306.03872] [QA].
LEACE: Perfect linear concept erasure in closed form - [2306.03819] [QA].
Learning to Ground Instructional Articles in Videos through Narrations - [2306.03802] [QA].
Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach - [2306.03604] [QA].
On Pitfalls of Test-Time Adaptation - [2306.03536] [QA].
Recognize Anything: A Strong Image Tagging Model - [2306.03514] [QA].
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias - [2306.03509] [QA].
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis - [2306.03504] [QA].
A Grasp Pose is All You Need: Learning Multi-fingered Grasping with Deep Reinforcement Learning from Vision and Touch - [2306.03484] [QA].
Natural Language Commanding via Program Synthesis - [2306.03460] [QA].
Large Language Models of Code Fail at Completing Code with Potential Bugs - [2306.03438] [QA].
GaitGCI: Generative Counterfactual Intervention for Gait Recognition - [2306.03428] [QA].
DVIS: Decoupled Video Instance Segmentation Framework - [2306.03413] [QA].
Vid2Act: Activate Offline Videos for Visual RL - [2306.03360] [QA].
Stabilizing Contrastive RL: Techniques for Offline Goal Reaching - [2306.03346] [QA].
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents - [2306.03314] [QA].
A Static Evaluation of Code Completion by Large Language Models - [2306.03203] [QA].
Neuralangelo: High-Fidelity Neural Surface Reconstruction - [2306.03092] [QA].
MotionDiffuser: Controllable Multi-Agent Motion Prediction using Diffusion - [2306.03083] [QA].
InstructZero: Efficient Instruction Optimization for Black-Box Large Language Models - [2306.03082] [QA].
HeadSculpt: Crafting 3D Head Avatars with Text - [2306.03038] [QA].
PokemonChat: Auditing ChatGPT for Pokémon Universe Knowledge - [2306.03024] [QA].
BeyondPixels: A Comprehensive Review of the Evolution of Neural Radiance Fields - [2306.03000] [QA].
PolyVoice: Language Models for Speech to Speech Translation - [2306.02982] [QA].
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding - [2306.02858] [QA].
Scene as Occupancy - [2306.02851] [QA].
Orca: Progressive Learning from Complex Explanation Traces of GPT-4 - [2306.02707] [QA].
LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion - [2306.02561] [QA].
RecAgent: A Novel Simulation Paradigm for Recommender Systems - [2306.02552] [QA].
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model - [2306.02531] [QA].
A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models - [2306.02254] [QA].
SAM3D: Zero-Shot 3D Object Detection via Segment Anything Model - [2306.02245] [QA].
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models - [2306.02080] [QA].
Prompting Is All You Need: Automated Android Bug Replay with Large Language Models - [2306.01987] [QA].
AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap - [2306.01941] [QA].
RITA: Group Attention is All You Need for Timeseries Analytics - [2306.01926] [QA].
The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation - [2306.01923] [QA].
VisualGPTScore: Visio-Linguistic Reasoning with Multimodal Generative Pre-Training Scores - [2306.01879] [QA].
Probabilistic Adaptation of Text-to-Video Models - [2306.01872] [QA].
Binary and Ternary Natural Language Generation - [2306.01841] [QA].
DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model - [2306.01736] [QA].
Evaluating Language Models for Mathematics through Interactions - [2306.01694] [QA].
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training - [2306.01693] [QA].
Harnessing large-language models to generate private synthetic text - [2306.01684] [QA].
STUDY: Socially Aware Temporally Causal Decoder Recommender Systems - [2306.07946] [QA].
Segment Anything in High Quality - [2306.01567] [QA].
Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection - [2306.01438] [QA].
An Empirical Study on Challenging Math Problem Solving with GPT-4 - [2306.01337] [QA].
LoCoOp: Few-Shot Out-of-Distribution Detection via Prompt Learning - [2306.01293] [QA].
Responsible Task Automation: Empowering Large Language Models as Responsible Task Automators - [2306.01242] [QA].
Faster Causal Attention Over Large Sequences Through Sparse Flash Attention - [2306.01160] [QA].
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only - [2306.01116] [QA].
Reimagining Retrieval Augmented Language Models for Answering Queries - [2306.01061] [QA].
Diffusion Self-Guidance for Controllable Image Generation - [2306.00986] [QA].
StyleDrop: Text-to-Image Generation in Any Style - [2306.00983] [QA].
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners - [2306.00984] [QA].
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds - [2306.00980] [QA].
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration - [2306.00978] [QA].
ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation - [2306.00971] [QA].
The Hidden Language of Diffusion Models - [2306.00966] [QA].
Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation - [2306.00964] [QA].
The ObjectFolder Benchmark: Multisensory Learning with Neural and Real Objects - [2306.00956] [QA].
Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance - [2306.00943] [QA].
STEVE-1: A Generative Model for Text-to-Behavior in Minecraft - [2306.00937] [QA].
Inserting Anybody in Diffusion Models via Celeb Basis - [2306.00926] [QA].
T2IAT: Measuring Valence and Stereotypical Biases in Text-to-Image Generation - [2306.00905] [QA].
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day - [2306.00890] [QA].
Birth of a Transformer: A Memory Viewpoint - [2306.00802] [QA].
Microstructure quality control of steels using deep learning - [2306.0797] [QA].
GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception Tasks? - [2306.00693] [QA].
Wuerstchen: Efficient Pretraining of Text-to-Image Models - [2306.00637] [QA].
ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing - [2306.00622] [QA].
Exploring Open-Vocabulary Semantic Segmentation without Human Labels - [2306.00450] [QA].
Example-based Motion Synthesis via Generative Motion Matching - [2306.00378] [QA].
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking - [2306.00323] [QA].
Rethinking Model Evaluation as Narrowing the Socio-Technical Gap - [2306.03100] [QA].

May 2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces - [2306.00245] [QA].
Bytes Are All You Need: Transformers Operating Directly On File Bytes - [2306.00238] [QA].
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models - [2306.00148] [QA].
MuseCoco: Generating Symbolic Music from Text - [2306.00110] [QA].
MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training - [2306.00107] [QA].
Humans in 4D: Reconstructing and Tracking Humans with Transformers - [2305.20091] [QA].
Improving CLIP Training with Language Rewrites - [2305.20088] [QA].
Too Large; Data Reduction for Vision-Language Pre-Training - [2305.20087] [QA].
Understanding and Mitigating Copying in Diffusion Models - [2305.20086] [QA].
Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor - [2305.20082] [QA].
Efficient Diffusion Policies for Offline Reinforcement Learning - [2305.20081] [QA].
Tree-Ring Watermarks: Fingerprints for Diffusion Images that are Invisible and Robust - [2305.20030] [QA].
Monotonic Location Attention for Length Generalization - [2305.20019] [QA].
Human or Not? A Gamified Approach to the Turing Test - [2305.20010] [QA].
Deliberate then Generate: Enhanced Prompting Framework for Text Generation - [2305.19835] [QA].
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models - [2305.19595] [QA].
Neural Kernel Surface Reconstruction - [2305.19590] [QA].
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM - [2306.00029] [QA].
PlaSma: Making Small Language Models Better Procedural Knowledge Models for (Counterfactual) Planning - [2305.19472] [QA].
The Impact of Positional Encoding on Length Generalization in Transformers - [2305.19466] [QA].
Bigger, Better, Faster: Human-level Atari with human-level efficiency - [2305.19452] [QA].
Blockwise Parallel Transformer for Large Context Models - [2305.19370] [QA].
AlteredAvatar: Stylizing Dynamic 3D Avatars with Fast Style Adaptation - [2305.19245] [QA].
Grammar Prompting for Domain-Specific Language Generation with Large Language Models - [2305.19234] [QA].
LANCE: Stress-testing Visual Models by Generating Language-guided Counterfactual Images - [2305.19164] [QA].
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate - [2305.19118] [QA].
Nested Diffusion Processes for Anytime Image Generation - [2305.19066] [QA].
Rank-adaptive spectral pruning of convolutional layers during training - [2305.19059] [QA].
StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation - [2305.19012] [QA].
Independent Component Alignment for Multi-Task Learning - [2305.19000] [QA].
LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus - [2305.18802] [QA].
HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance - [2305.18766] [QA].
VSTAR: A Video-grounded Dialogue Dataset for Situated Semantic Understanding with Scene and Topic Transitions - [2305.18756] [QA].
GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction - [2305.18752] [QA].
Real-World Image Variation by Aligning Diffusion Inversion Chain - [2305.18729] [QA].
Faith and Fate: Limits of Transformers on Compositionality - [2305.18654] [QA].
Controllable Text-to-Image Generation with GPT-4 - [2305.18583] [QA].
PaLI-X: On Scaling up a Multilingual Vision and Language Model - [2305.18565] [QA].
Brainformers: Trading Simplicity for Efficiency - [2306.00008] [QA].
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths - [2305.18295] [QA].
Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models - [2305.18292] [QA].
Direct Preference Optimization: Your Language Model is Secretly a Reward Model - [2305.18290] [QA].
Photoswap: Personalized Subject Swapping in Images - [2305.18286] [QA].
Contextual Object Detection with Multimodal Large Language Models - [2305.18279] [QA].
Reconstructing the Mind's Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors - [2305.18274] [QA].
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising - [2305.18264] [QA].
GlyphControl: Glyph Conditional Control for Visual Text Generation - [2305.18259] [QA].
TaleCrafter: Interactive Story Visualization with Multiple Characters - [2305.18247] [QA].
Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models - [2305.18189] [QA].
Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models - [2305.18507] [QA].
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning - [2305.18499] [QA].
BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages - [2305.18098] [QA].
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation - [2305.18474] [QA].
DiffRate : Differentiable Compression Rate for Efficient Vision Transformers - [2305.17997] [QA].
Efficient Storage of Fine-Tuned Models via Low-Rank Approximation of Weight Residuals - [2305.18425] [QA].
Geometric Algebra Transformers - [2305.18415] [QA].
KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models - [2305.18373] [QA].
Data Minimization at Inference Time - [2305.17593] [QA].
Scalable Transformer for PDE Surrogate Modeling - [2305.17560] [QA].
The Curse of Recursion: Training on Generated Data Makes Models Forget - [2305.17493] [QA].
What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks - [2305.18365] [QA].
SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks - [2305.17390] [QA].
MPCHAT: Towards Multimodal Persona-Grounded Conversation - [2305.17388] [QA].
Augmenting Large Language Model Translators via Translation Memories - [2305.17367] [QA].
DNA-GPT: Divergent N-Gram Analysis for Training-Free Detection of GPT-Generated Text - [2305.17359] [QA].
Fine-Tuning Language Models with Just Forward Passes - [2305.17333] [QA].
Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models - [2305.17311] [QA].
Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance - [2305.17306] [QA].
SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL - [2306.00739] [QA].
Generating Images with Multimodal Language Models - [2305.17216] [QA].
Large Language Models as Tool Makers - [2305.17126] [QA].
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time - [2305.17118] [QA].
High-Fidelity Image Compression with Score-based Generative Models - [2305.18231] [QA].
ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing - [2305.17098] [QA].
Mindstorms in Natural Language-Based Societies of Mind - [2305.17066] [QA].
SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation - [2305.17011] [QA].
Let the Flows Tell: Solving Graph Combinatorial Optimization Problems with GFlowNets - [2305.17010] [QA].
Three Towers: Flexible Contrastive Learning with Pretrained Image Models - [2305.16999] [QA].
Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitation - [2305.16985] [QA].
Training Socially Aligned Language Models in Simulated Human Society - [2305.16960] [QA].
MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies - [2305.16958] [QA].
On Evaluating Adversarial Robustness of Large Vision-Language Models - [2305.16934] [QA].
MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting - [2305.16896] [QA].
Playing repeated games with Large Language Models - [2305.16867] [QA].
Randomized Positional Encodings Boost Length Generalization of Transformers - [2305.16843] [QA].
Selective Mixup Helps with Distribution Shifts, But Not (Only) because of Mixup - [2305.16817] [QA].
Do GPTs Produce Less Literal Translations? - [2305.16806] [QA].
Multimodal Recommendation Dialog with Subjective Preference: A New Challenge and Benchmark - [2305.18212] [QA].
A Closer Look at In-Context Learning under Distribution Shifts - [2305.16704] [QA].
AdaPlanner: Adaptive Planning from Feedback with Language Models - [2305.16653] [QA].
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing - [2305.16635] [QA].
Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models - [2305.16582] [QA].
On the Tool Manipulation Capability of Open-source Large Language Models - [2305.16504] [QA].
ZeroAvatar: Zero-shot 3D Avatar Generation from a Single Image - [2305.16411] [QA].
Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory - [2305.17144] [QA].
Break-A-Scene: Extracting Multiple Concepts from a Single Image - [2305.16311] [QA].
Landmark Attention: Random-Access Infinite Context Length for Transformers - [2305.16300] [QA].
Voyager: An Open-Ended Embodied Agent with Large Language Models - [2305.16291] [QA].
DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models - [2305.16381] [QA].
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation - [2305.16213] [QA].
Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer - [2305.16380] [QA].
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst - [2305.16103] [QA].
Role-Play with Large Language Models - [2305.16367] [QA].
On Architectural Compression of Text-to-Image Diffusion Models - [2305.15798] [QA].
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models - [2305.15779] [QA].
On the Planning Abilities of Large Language Models -- A Critical Investigation - [2305.15771] [QA].
Efficient Neural Music Generation - [2305.15719] [QA].
The False Promise of Imitating Proprietary LLMs - [2305.15717] [QA].
PandaGPT: One Model To Instruction-Follow Them All - [2305.16355] [QA].
Manifold Diffusion Fields - [2305.15586] [QA].
Unsupervised Semantic Correspondence Using Stable Diffusion - [2305.15581] [QA].
Lexinvariant Language Models - [2305.16349] [QA].
SPRING: GPT-4 Out-performs RL Algorithms by Studying Papers and Reasoning - [2305.15486] [QA].
LayoutGPT: Compositional Visual Planning and Generation with Large Language Models - [2305.15393] [QA].
Learning high-level visual representations from a child's perspective without strong inductive biases - [2305.15372] [QA].
Gorilla: Large Language Model Connected with Massive APIs - [2305.15334] [QA].
Visual Programming for Text-to-Image Generation and Evaluation - [2305.15328] [QA].
Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy - [2305.15294] [QA].
ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers - [2305.15272] [QA].
Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration - [2305.15262] [QA].
Adaptive Policy Learning to Additional Tasks - [2305.15193] [QA].
Policy Learning based on Deep Koopman Representation - [2305.15188] [QA].
Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies - [2305.15115] [QA].
Dynamic Masking Rate Schedules for MLM Pretraining - [2305.15096] [QA].
Is GPT-4 a Good Data Analyst? - [2305.15038] [QA].
Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models - [2305.15023] [QA].
EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought - [2305.15021] [QA].
Reasoning with Language Model is Planning with World Model - [2305.14992] [QA].
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models - [2305.14985] [QA].
Benchmarking Arabic AI with Large Language Models - [2305.14982] [QA].
Assessment of the Reliablity of a Model's Decision by Generalizing Attribution to the Wavelet Domain - [2305.14979] [QA].
Discriminator-Guided Multi-step Reasoning with Language Models - [2305.14934] [QA].
Leveraging GPT-4 for Automatic Translation Post-Editing - [2305.14878] [QA].
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts - [2305.14839] [QA].
Adapting Language Models to Compress Contexts - [2305.14788] [QA].
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models - [2305.14710] [QA].
ExpertPrompting: Instructing Large Language Models to be Distinguished Experts - [2305.14688] [QA].
Barkour: Benchmarking Animal-level Agility with Quadruped Robots - [2305.14654] [QA].
Enabling Large Language Models to Generate Text with Citations - [2305.14627] [QA].
Think Before You Act: Decision Transformers with Internal Working Memory - [2305.16338] [QA].
Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy - [2305.14596] [QA].
PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents - [2305.14564] [QA].
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond - [2305.14540] [QA].
Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement - [2305.14497] [QA].
Video Prediction Models as Rewards for Reinforcement Learning - [2305.14343] [QA].
Automatic Model Selection with Large Language Models for Reasoning - [2305.14333] [QA].
Improving Factuality and Reasoning in Language Models through Multiagent Debate - [2305.14325] [QA].
ChatCoT: Tool-Augmented Chain-of-Thought Reasoning on Chat-based Large Language Models - [2305.14323] [QA].
RET-LLM: Towards a General Read-Write Memory for Large Language Models - [2305.14322] [QA].
CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation - [2305.14318] [QA].
QLoRA: Efficient Finetuning of Quantized LLMs - [2305.14314] [QA].
On Learning to Summarize with Large Language Models as References - [2305.14239] [QA].
REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos - [2305.14236] [QA].
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations - [2305.14233] [QA].
Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks - [2305.14201] [QA].
DetGPT: Detect What You Need via Reasoning - [2305.14167] [QA].
Let's Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction - [2305.13903] [QA].
PaD: Program-aided Distillation Specializes Large Models in Reasoning - [2305.13888] [QA].
OlaGPT: Empowering LLMs With Human-like Problem-Solving Abilities - [2305.16334] [QA].
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models - [2305.13840] [QA].
Can Large Language Models Infer and Disagree Like Humans? - [2305.13788] [QA].
Perception Test: A Diagnostic Benchmark for Multimodal Video Models - [2305.13786] [QA].
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks - [2305.13782] [QA].
Aligning Large Language Models through Synthetic Feedback - [2305.13735] [QA].
Text Is All You Need: Learning Language Representations for Sequential Recommendation - [2305.13731] [QA].
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration - [2305.13626] [QA].
Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning? - [2306.01754] [QA].
Enhancing Detail Preservation for Customized Text-to-Image Generation: A Regularization-Free Approach - [2305.13579] [QA].
How Language Model Hallucinations Can Snowball - [2305.13534] [QA].
RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text - [2305.13304] [QA].
Training Diffusion Models with Reinforcement Learning - [2305.13301] [QA].
Interactive Natural Language Processing - [2305.13246] [QA].
LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities - [2305.13168] [QA].
ControlVideo: Training-free Controllable Text-to-Video Generation - [2305.13077] [QA].
Making Language Models Better Tool Learners with Execution Feedback - [2305.13068] [QA].
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation - [2305.13050] [QA].
RWKV: Reinventing RNNs for the Transformer Era - [2305.13048] [QA].
Textually Pretrained Speech Language Models - [2305.13009] [QA].
Boosting Long-tailed Object Detection via Step-wise Learning on Smooth-tail Data - [2305.12833] [QA].
Keeping Up with the Language Models: Robustness-Bias Interplay in NLI Data and Models - [2305.12620] [QA].
GMD: Controllable Human Motion Synthesis via Guided Diffusion Models - [2305.12577] [QA].
Conditional Generative Modeling is All You Need for Marked Temporal Point Processes - [2305.12569] [QA].
Augmenting Autotelic Agents with Large Language Models - [2305.12487] [QA].
Advancing Referring Expression Segmentation Beyond Single Image - [2305.12452] [QA].
CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring - [2305.12050] [QA].
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models - [2305.12001] [QA].
Exploring the Viability of Synthetic Query Generation for Relevance Prediction - [2305.11944] [QA].
XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages - [2305.11938] [QA].
Chupa: Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic Models - [2305.11870] [QA].
Scaling laws for language encoding models in fMRI - [2305.11863] [QA].
Multimodal Web Navigation with Instruction-Finetuned Foundation Models - [2305.11854] [QA].
Any-to-Any Generation via Composable Diffusion - [2305.11846] [QA].
How Does Generative Retrieval Scale to Millions of Passages? - [2305.11841] [QA].
SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models - [2305.11840] [QA].
Comparing Software Developers with ChatGPT: An Empirical Investigation - [2305.11837] [QA].
Pengi: An Audio Language Model for Audio Tasks - [2305.11834] [QA].
Cross-Lingual Supervision improves Large Language Models Pre-training - [2305.11778] [QA].
Neural Foundations of Mental Simulation: Future Prediction of Latent Representations on Dynamic Scenes - [2305.11772] [QA].
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning - [2305.11759] [QA].
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing - [2305.11738] [QA].
QUEST: A Retrieval Dataset of Entity-Seeking Queries with Implicit Set Operations - [2305.11694] [QA].
Learning Global-aware Kernel for Image Harmonization - [2305.11676] [QA].
Cinematic Mindscapes: High-quality Video Reconstruction from Brain Activity - [2305.11675] [QA].
Introspective Tips: Large Language Model for In-Context Decision Making - [2305.11598] [QA].
Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields - [2305.11588] [QA].
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - [2305.11554] [QA].
Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering - [2305.11541] [QA].
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought - [2305.11499] [QA].
Enhancing Personalized Dialogue Generation with Contrastive Latent Variables: Combining Sparse and Dense Persona - [2305.11482] [QA].
Towards Human-AI Collaborative Urban Science Research Enabled by Pre-trained Large Language Models - [2305.11418] [QA].
Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models - [2305.11364] [QA].
RoomDreamer: Text-Driven 3D Indoor Scene Synthesis with Coherent Geometry and Texture - [2305.11337] [QA].
Counterfactuals for Design: A Model-Agnostic Method For Design Recommendations - [2305.11308] [QA].
Towards Collaborative Plan Acquisition through Theory of Mind Modeling in Situated Dialogue - [2305.11271] [QA].
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model - [2305.11176] [QA].
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks - [2305.11175] [QA].
Going Denser with Open-Vocabulary Part Segmentation - [2305.11173] [QA].
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models - [2305.11171] [QA].
Evidence of Meaning in Language Models Trained on Programs - [2305.11169] [QA].
TOME: A Two-stage Approach for Model-based Retrieval - [2305.11161] [QA].
LIMA: Less Is More for Alignment - [2305.11206] [QA].
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild - [2305.11147] [QA].
SimOAP: Improve Coherence and Consistency in Persona-based Dialogue Generation via Over-sampling and Post-evaluation - [2305.11130] [QA].
mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences - [2305.11129] [QA].
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation - [2305.11116] [QA].
PDP: Parameter-free Differentiable Pruning is All You Need - [2305.11203] [QA].
DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs - [2309.03907] [QA].
Inspecting the Geographical Representativeness of Images from Text-to-Image Models - [2305.11080] [QA].
SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation - [2305.11012] [QA].
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities - [2305.11000] [QA].
Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold - [2305.10973] [QA].
An Android Robot Head as Embodied Conversational Agent - [2305.10945] [QA].
A Generalist Dynamics Model for Control - [2305.10912] [QA].
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation - [2305.10874] [QA].
TextDiffuser: Diffusion Models as Text Painters - [2305.10855] [QA].
3D Registration with Maximal Cliques - [2305.10854] [QA].
LDM3D: Latent Diffusion Model for 3D - [2305.10853] [QA].
GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework - [2305.10841] [QA].
Listen, Think, and Understand - [2305.10790] [QA].
OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding - [2305.10764] [QA].
CLAPSpeech: Learning Prosody from Text Context with Contrastive Language-Audio Pre-training - [2305.10763] [QA].
Boost Vision Transformer with GPU-Friendly Sparsity and Quantization - [2305.10727] [QA].
Discriminative Diffusion Models as Few-shot Vision and Language Learners - [2305.10722] [QA].
Zero-Day Backdoor Attack against Text-to-Image Diffusion Models via Personalization - [2305.10701] [QA].
MolXPT: Wrapping Molecules with Text for Generative Pre-training - [2305.10688] [QA].
Language Models Meet World Models: Embodied Experiences Enhance Language Models - [2305.10626] [QA].
Tree of Thoughts: Deliberate Problem Solving with Large Language Models - [2305.10601] [QA].
Instruction Tuned Models are Quick Learners - [2306.05539] [QA].
IMAD: IMage-Augmented multi-modal Dialogue - [2305.10512] [QA].
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention - [2305.10431] [QA].
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models - [2305.10474] [QA].
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining - [2305.10429] [QA].
SLiC-HF: Sequence Likelihood Calibration with Human Feedback - [2305.10425] [QA].
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering - [2305.10415] [QA].
PaLM 2 Technical Report - [2305.10403] [QA].
What You See is What You Read? Improving Text-Image Alignment Evaluation - [2305.10400] [QA].
Elaborative Simplification as Implicit Questions Under Discussion - [2305.10387] [QA].
Evaluating Object Hallucination in Large Vision-Language Models - [2305.10355] [QA].
CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo - [2305.10320] [QA].
Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability - [2305.10266] [QA].
MemoryBank: Enhancing Large Language Models with Long-Term Memory - [2305.10250] [QA].
Knowledge-enhanced Mixed-initiative Dialogue System for Emotional Support Conversations - [2305.10172] [QA].
Improving Language Model Negotiation with Self-Play and In-Context Learning from AI Feedback - [2305.10142] [QA].
Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers - [2305.10018] [QA].
DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning - [2305.10005] [QA].
Dual Semantic Knowledge Composed Multimodal Dialog Systems - [2305.09990] [QA].
Smart Word Suggestions for Writing Assistance - [2305.09975] [QA].
Towards Generalist Robots: A Promising Paradigm via Generative Simulation - [2305.10455] [QA].
Explaining black box text modules in natural language with language models - [2305.09863] [QA].
CoEdIT: Text Editing by Task-Specific Instruction Tuning - [2305.09857] [QA].
ConvXAI: Delivering Heterogeneous AI Explanations via Conversations to Support Human-AI Scientific Writing - [2305.09770] [QA].
Application-Agnostic Language Modeling for On-Device ASR - [2305.09764] [QA].
NerfBridge: Bringing Real-time, Online Neural Radiance Field Training to Robotics - [2305.09761] [QA].
A Video Is Worth 4096 Tokens: Verbalize Story Videos To Understand Them In Zero Shot - [2305.09758] [QA].
Understanding 3D Object Interaction from a Single Image - [2305.09664] [QA].
Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation - [2305.09662] [QA].
FitMe: Deep Photorealistic 3D Morphable Model Avatars - [2305.09641] [QA].
SoundStorm: Efficient Parallel Audio Generation - [2305.09636] [QA].
Towards Expert-Level Medical Question Answering with Large Language Models - [2305.09617] [QA].
Large Language Models are Built-in Autoregressive Search Engines - [2305.09612] [QA].
Cooperation Is All You Need - [2305.10449] [QA].
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation - [2305.09515] [QA].
Online Continual Learning Without the Storage Constraint - [2305.09253] [QA].
Dual-Alignment Pre-training for Cross-lingual Sentence Embedding - [2305.09148] [QA].
Pre-Training to Learn in Context - [2305.09137] [QA].
SuSana Distancia is all you need: Enforcing class separability in metric learning via two novel distance-based loss functions for few-shot image classification - [2305.09062] [QA].
MV-Map: Offboard HD-Map Generation with Multi-view Consistency - [2305.08851] [QA].
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts - [2305.08850] [QA].
Small Models are Valuable Plug-ins for Large Language Models - [2305.08848] [QA].
RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs - [2305.08844] [QA].
Straightening Out the Straight-Through Estimator: Overcoming Optimization Challenges in Vector Quantized Networks - [2305.08842] [QA].
Attacking Perceptual Similarity Metrics - [2305.08840] [QA].
AutoRecon: Automated 3D Object Discovery and Reconstruction - [2305.08810] [QA].
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca - [2305.08809] [QA].
A Reproducible Extraction of Training Images from Diffusion Models - [2305.08694] [QA].
Natural Language Decomposition and Interpretation of Complex Utterances - [2305.08677] [QA].
DarkBERT: A Language Model for the Dark Side of the Internet - [2305.08596] [QA].
Common Diffusion Noise Schedules and Sample Steps are Flawed - [2305.08891] [QA].
TESS: Text-to-Text Self-Conditioned Simplex Diffusion - [2305.08379] [QA].
Symbol tuning improves in-context learning in language models - [2305.08298] [QA].
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding - [2305.08275] [QA].
A Cognitive Stimulation Dialogue System with Multi-source Knowledge Fusion for Elders with Cognitive Impairment - [2305.08200] [QA].
GPT-Sentinel: Distinguishing Human and ChatGPT Generated Content - [2305.07969] [QA].
Leveraging Large Language Models in Conversational Recommender Systems - [2305.07961] [QA].
CodeT5+: Open Code Large Language Models for Code Understanding and Generation - [2305.07922] [QA].
Improving Small Language Models on PubMedQA via Generative Data Augmentation - [2305.07804] [QA].
ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems - [2305.07797] [QA].
TinyStories: How Small Can Language Models Be and Still Speak Coherent English? - [2305.07759] [QA].
In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making - [2305.07722] [QA].
What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization - [2305.07615] [QA].
Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation - [2305.07609] [QA].
Measuring Progress in Fine-grained Vision-and-Language Understanding - [2305.07558] [QA].
BlendFields: Few-Shot Example-Driven Facial Modeling - [2305.07514] [QA].
ArtGPT-4: Artistic Vision-Language Understanding with Adapter-enhanced MiniGPT-4 - [2305.07490] [QA].
Surfacing Biases in Large Language Models using Contrastive Input Decoding - [2305.07378] [QA].
Better speech synthesis through scaling - [2305.07243] [QA].
MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition - [2305.07214] [QA].
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers - [2305.07185] [QA].
Masked Audio Text Encoders are Effective Multi-Modal Rescorers - [2305.07677] [QA].
Towards best practices in AGI safety and governance: A survey of expert opinion - [2305.07153] [QA].
EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention - [2305.07027] [QA].
Simple Token-Level Confidence Improves Caption Correctness - [2305.07021] [QA].
An Inverse Scaling Law for CLIP Training - [2305.07017] [QA].
Exploiting Diffusion Prior for Real-World Image Super-Resolution - [2305.07015] [QA].
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers - [2305.07011] [QA].
Learning the Visualness of Text Using Large Vision-Language Models - [2305.10434] [QA].
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting - [2305.07004] [QA].
Universal Source Separation with Weakly Labelled Data - [2305.07447] [QA].
CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency Model - [2305.06908] [QA].
A Category-theoretical Meta-analysis of Definitions of Disentanglement - [2305.06886] [QA].
Optimizing Memory Mapping Using Deep Reinforcement Learning - [2305.07440] [QA].
Distracting Downpour: Adversarial Weather Attacks for Motion Estimation - [2305.06716] [QA].
V2Meow: Meowing to the Visual Beat via Music Generation - [2305.06594] [QA].
Chain-of-Dictionary Prompting Elicits Translation in Large Language Models - [2305.06575] [QA].
How to Index Item IDs for Recommendation Foundation Models - [2305.06569] [QA].
Segment and Track Anything - [2305.06558] [QA].
Domain Incremental Lifelong Learning in an Open World - [2305.06555] [QA].
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning - [2305.06500] [QA].
Do LLMs Understand User Preferences? Evaluating LLMs On User Rating Prediction - [2305.06474] [QA].
Perpetual Humanoid Control for Real-time Simulated Avatars - [2305.06456] [QA].
Bot or Human? Detecting ChatGPT Imposters with A Single Question - [2305.06424] [QA].
LACoS-BLOOM: Low-rank Adaptation with Contrastive objective on 8 bits Siamese-BLOOM - [2305.06404] [QA].
HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion - [2305.06356] [QA].
VideoChat: Chat-Centric Video Understanding - [2305.06355] [QA].
Reconstructing Animatable Categories from Videos - [2305.06351] [QA].
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception - [2305.06324] [QA].
Summarizing, Simplifying, and Synthesizing Medical Evidence Using GPT-3 (with Varying Success) - [2305.06299] [QA].
Generative AI meets 3D: A Survey on Text-to-3D in AIGC Era - [2305.06131] [QA].
The Compositional Structure of Bayesian Inference - [2305.06112] [QA].
Relightify: Relightable 3D Faces from a Single Image via Diffusion Models - [2305.06077] [QA].
GPT Models Meet Robotic Applications: Co-Speech Gesturing Chat System - [2306.01741] [QA].
Privacy-Preserving Recommender Systems with Synthetic Query Generation using Differentially Private Large Language Models - [2305.05973] [QA].
Fast Distributed Inference Serving for Large Language Models - [2305.05920] [QA].
SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds - [2305.05873] [QA].
Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? An Examination on Several Typical Tasks - [2305.05862] [QA].
Sketching the Future (STF): Applying Conditional Control Techniques to Text-to-Video Models - [2305.05845] [QA].
DexArt: Benchmarking Generalizable Dexterous Manipulation with Articulated Objects - [2305.05706] [QA].
InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language - [2305.05662] [QA].
TidyBot: Personalized Robot Assistance with Large Language Models - [2305.05658] [QA].
Towards Building the Federated GPT: Federated Instruction Tuning - [2305.05644] [QA].
AudioSlots: A slot-centric generative model for audio separation - [2305.05591] [QA].
Recursions Are All You Need: Towards Efficient Deep Unfolding Networks - [2305.05505] [QA].
WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset - [2305.05432] [QA].
Large Language Model Programs - [2305.05364] [QA].
Dialogue Planning via Brownian Bridge Stochastic Process for Goal-directed Proactive Dialogue - [2305.05290] [QA].
Distilling Script Knowledge from Large Language Models for Constrained Language Planning - [2305.05252] [QA].
SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models - [2305.05189] [QA].
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance - [2305.05176] [QA].
Knowledge-enhanced Agents for Interactive Text Games - [2305.05091] [QA].
Multi-Task End-to-End Training Improves Conversational Recommendation - [2305.06218] [QA].
Recommender Systems with Generative Retrieval - [2305.05065] [QA].
NerfAcc: Efficient Sampling Accelerates NeRFs - [2305.04966] [QA].
A Drop of Ink Makes a Million Think: The Spread of False Information in Large Language Models - [2305.04812] [QA].
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans - [2305.04790] [QA].
AvatarReX: Real-time Expressive Full-body Avatars - [2305.04789] [QA].
Controllable Light Diffusion for Portraits - [2305.04745] [QA].
Code Execution with Pre-trained Language Models - [2305.05383] [QA].
LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition - [2305.04536] [QA].
Video Object Segmentation in Panoptic Wild Scenes - [2305.04470] [QA].
Locally Attentional SDF Diffusion for Controllable 3D Shape Generation - [2305.04461] [QA].
Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models - [2305.04441] [QA].
A Variational Perspective on Solving Inverse Problems with Diffusion Models - [2305.04391] [QA].
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting - [2305.04388] [QA].
Unified Demonstration Retriever for In-Context Learning - [2305.04320] [QA].
Multi-Space Neural Radiance Fields - [2305.04268] [QA].
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens - [2305.04241] [QA].
Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning - [2305.04175] [QA].
X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages - [2305.04160] [QA].
Exploring Human-Like Translation Strategy with Large Language Models - [2305.04118] [QA].
Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models - [2305.04091] [QA].
Pre-training Language Model as a Multi-perspective Course Learner - [2305.03981] [QA].
Residual Prompt Tuning: Improving Prompt Tuning with Residual Reparameterization - [2305.03937] [QA].
Otter: A Multi-Modal Model with In-Context Instruction Tuning - [2305.03726] [QA].
Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos - [2305.03713] [QA].
LMEye: An Interactive Perception Network for Large Language Models - [2305.03701] [QA].
Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements - [2305.03695] [QA].
Mining bias-target Alignment from Voronoi Cells - [2305.03691] [QA].
COLA: A Benchmark for Compositional Text-to-image Retrieval - [2305.03689] [QA].
A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding - [2305.03668] [QA].
Query Expansion by Prompting Large Language Models - [2305.03653] [QA].
T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering - [2305.03453] [QA].
TransESC: Smoothing Emotional Support Conversation via Turn-Level State Transition - [2305.03296] [QA].
Composite Motion Learning with Task Control - [2305.03286] [QA].
Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework - [2305.03268] [QA].
AttentionViz: A Global View of Transformer Attention - [2305.03210] [QA].
Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs - [2305.03111] [QA].
ZipIt! Merging Models from Different Tasks without Training - [2305.03053] [QA].
Tracking through Containers and Occluders in the Wild - [2305.03052] [QA].
Controllable Visual-Tactile Synthesis - [2305.03051] [QA].
NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds - [2305.03049] [QA].
Personalize Segment Anything Model with One Shot - [2305.03048] [QA].
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision - [2305.03047] [QA].
Single-Shot Implicit Morphable Faces with Consistent Texture Parameterization - [2305.03043] [QA].
TUVF: Learning Generalizable Texture UV Radiance Fields - [2305.03040] [QA].
NeRSemble: Multi-view Radiance Field Reconstruction of Human Heads - [2305.03027] [QA].
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion - [2305.03509] [QA].
Masked Trajectory Models for Prediction, Representation, and Control - [2305.02968] [QA].
BranchNorm: Robustly Scaling Extremely Deep Transformers - [2305.02790] [QA].
A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects - [2305.02750] [QA].
Real-Time Neural Appearance Models - [2305.02678] [QA].
Caption Anything: Interactive Image Description with Diverse Multimodal Controls - [2305.02677] [QA].
Learning Language-Specific Layers for Multilingual Machine Translation - [2305.02665] [QA].
Semantically Structured Image Compression via Irregular Group-Based Decoupling - [2305.02586] [QA].
Should ChatGPT and Bard Share Revenue with Their Data Providers? A New Business Model for the AI Era - [2305.02555] [QA].
FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction - [2305.02549] [QA].
AutoML-GPT: Automatic Machine Learning with GPT - [2305.02499] [QA].
ChatGPT-steered Editing Instructor for Customization of Abstractive Summarization - [2305.02483] [QA].
Shap-E: Generating Conditional 3D Implicit Functions - [2305.02463] [QA].
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs - [2305.02440] [QA].
Plan, Eliminate, and Track -- Language Models are Good Teachers for Embodied Agents - [2305.02412] [QA].
Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts - [2305.02320] [QA].
Visual Chain of Thought: Bridging Logical Gaps with Multimodal Infillings - [2305.02317] [QA].
Uncovering ChatGPT's Capabilities in Recommender Systems - [2305.02182] [QA].
Zero-Shot Listwise Document Reranking with a Large Language Model - [2305.02156] [QA].
Multimodal Procedural Planning via Dual Text-Image Prompting - [2305.01795] [QA].
Automated Code generation for Information Technology Tasks in YAML through Large Language Models - [2305.02783] [QA].
Stars Are All You Need: A Distantly Supervised Pyramid Network for Document-Level End-to-End Sentiment Analysis - [2305.01710] [QA].
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis - [2305.00976] [QA].
Unlimiformer: Long-Range Transformers with Unlimited Length Input - [2305.01625] [QA].
Transfer Visual Prompt Generator across LLMs - [2305.01278] [QA].
The Role of Summarization in Generative Agents: A Preliminary Perspective - [2305.01253] [QA].
ArK: Augmented Reality with Knowledge Interactive Emergent Ability - [2305.00970] [QA].
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation - [2305.00955] [QA].
Hypernuclear event detection in the nuclear emulsion with Monte Carlo simulation and machine learning - [2305.0884] [QA].
Learning to Reason and Memorize with Self-Notes - [2305.00833] [QA].
Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation - [2305.00673] [QA].

April 2023

TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation - [2305.00447] [QA].
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model - [2304.15010] [QA].
Topic-oriented Adversarial Attacks against Black-box Neural Ranking Models - [2304.14867] [QA].
A Unified Generative Retriever for Knowledge-Intensive Language Tasks via Prompt Learning - [2304.14856] [QA].
IMP: Iterative Matching and Pose Estimation with Adaptive Pooling - [2304.14837] [QA].
Multivariate Representation Learning for Information Retrieval - [2304.14522] [QA].
Framing the News:From Human Perception to Large Language Model Inferences - [2304.14456] [QA].
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System - [2304.14407] [QA].
Large Language Models are Strong Zero-Shot Retriever - [2304.14233] [QA].
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality - [2304.14178] [QA].
Categorification of Group Equivariant Neural Networks - [2304.14144] [QA].
ChatLog: Recording and Analyzing ChatGPT Across Time - [2304.14106] [QA].
Learning Human-Human Interactions in Images from Weak Textual Supervision - [2304.14104] [QA].
Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks - [2304.13861] [QA].
Multi-Party Chat: Conversational Agents in Group Settings with Humans and Models - [2304.13835] [QA].
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond - [2304.13712] [QA].
Multimodal Grounding for Embodied AI via Augmented Reality Headsets for Natural Language Driven Task Planning - [2304.13676] [QA].
Unleashing Infinite-Length Input Capacity for Large-scale Language Models with Self-Controlled Memory System - [2304.13343] [QA].
EverLight: Indoor-Outdoor Editable HDR Lighting Estimation - [2304.13207] [QA].
SAFE: Machine Unlearning With Shard Graphs - [2304.13169] [QA].
Generative Relevance Feedback with Large Language Models - [2304.13157] [QA].
Answering Questions by Meta-Reasoning over Multiple Chains of Thought - [2304.13007] [QA].
Patch-based 3D Natural Scene Generation from a Single Example - [2304.12670] [QA].
Bayesian Optimization Meets Self-Distillation - [2304.12666] [QA].
Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks - [2304.12567] [QA].
GlyphDiffusion: Text Generation as Image Generation - [2304.12519] [QA].
On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research - [2304.12397] [QA].
Beyond the Pixel: a Photometrically Calibrated HDR Dataset for Luminance and Color Prediction - [2304.12372] [QA].
WizardLM: Empowering Large Language Models to Follow Complex Instructions - [2304.12244] [QA].
Track Anything: Segment Anything Meets Videos - [2304.11968] [QA].
ChatLLM Network: More brains, More intelligence - [2304.12998] [QA].
Universal Domain Adaptation via Compressive Attention Matching - [2304.11862] [QA].
Enhancing Fine-Tuning Based Backdoor Defense with Sharpness-Aware Minimization - [2304.11823] [QA].
Score-Based Diffusion Models as Principled Priors for Inverse Imaging - [2304.11751] [QA].
SketchXAI: A First Look at Explainability for Human Sketches - [2304.11744] [QA].
Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation - [2304.11705] [QA].
SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models - [2304.11619] [QA].
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations - [2304.11267] [QA].
Emergent and Predictable Memorization in Large Language Models - [2304.11158] [QA].
ChatABL: Abductive Learning via Natural Language Interaction with ChatGPT - [2304.11107] [QA].
Can GPT-4 Perform Neural Architecture Search? - [2304.10970] [QA].
Auditing and Generating Synthetic Data with Controllable Trust Trade-offs - [2304.10819] [QA].
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models - [2304.10700] [QA].
HM-ViT: Hetero-modal Vehicle-to-Vehicle Cooperative perception with vision transformer - [2304.10628] [QA].
Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels - [2304.10539] [QA].
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models - [2304.10592] [QA].
Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance - [2304.10528] [QA].
Phoenix: Democratizing ChatGPT across Languages - [2304.10453] [QA].
SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation - [2304.10417] [QA].
SCoDA: Domain Adaptive Shape Completion for Real Scans - [2304.10179] [QA].
Learning Bottleneck Concepts in Image Classification - [2304.10131] [QA].
Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation - [2304.10066] [QA].
MARS: Model-agnostic Biased Object Removal without Additional Supervision for Weakly-Supervised Semantic Segmentation - [2304.09913] [QA].
Evaluating Verifiability in Generative Search Engines - [2304.09848] [QA].
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models - [2304.09842] [QA].
MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation - [2304.09801] [QA].
DarSwin: Distortion Aware Radial Swin Transformer - [2304.09691] [QA].
Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent - [2304.09542] [QA].
Network Pruning Spaces - [2304.09453] [QA].
ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling - [2304.09423] [QA].
To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review - [2304.09355] [QA].
Fast Neural Scene Flow - [2304.09121] [QA].
Think Before You Act: Unified Policy for Interleaving Language Reasoning with Actions - [2304.11063] [QA].
In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT - [2304.08979] [QA].
SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes - [2304.08971] [QA].
Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections - [2304.08706] [QA].
An Evaluation on Large Language Model Outputs: Discourse and Memorization - [2304.08637] [QA].
Visual Instruction Tuning - [2304.08485] [QA].
Towards Robust Prompts on Vision-Language Models - [2304.08479] [QA].
Learning to Compress Prompts with Gist Tokens - [2304.08467] [QA].
Efficient Video Action Detection with Token Dropout and Context Refinement - [2304.08451] [QA].
Tool Learning with Foundation Models - [2304.08354] [QA].
Magnitude of arithmetic scalar and matrix categories - [2304.08334] [QA].
Chain of Thought Prompt Tuning in Vision Language Models - [2304.07919] [QA].
Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation - [2304.07854] [QA].
EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation - [2304.07803] [QA].
Self-collaboration Code Generation via ChatGPT - [2304.07590] [QA].
Tractable Control for Autoregressive Language Generation - [2304.07438] [QA].
DINOv2: Learning Robust Visual Features without Supervision - [2304.07193] [QA].
M2T: Masking Transformers Twice for Faster Decoding - [2304.07313] [QA].
Delta Denoising Score - [2304.07090] [QA].
DCFace: Synthetic Face Generation with Dual Condition Diffusion Model - [2304.07060] [QA].
DeePoint: Visual Pointing Recognition and Direction Estimation - [2304.06977] [QA].
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text - [2304.06939] [QA].
Unified Out-Of-Distribution Detection: A Model-Specific Perspective - [2304.06813] [QA].
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment - [2304.06767] [QA].
Expressive Text-to-Image Generation with Rich Text - [2304.06720] [QA].
Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction - [2304.06714] [QA].
What does CLIP know about a red circle? Visual prompt engineering for VLMs - [2304.06712] [QA].
DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer - [2304.06668] [QA].
DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning - [2304.06648] [QA].
Are LLMs All You Need for Task-Oriented Dialogue? - [2304.06556] [QA].
Perspectives on Large Language Models for Relevance Judgment - [2304.09161] [QA].
Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning - [2304.06461] [QA].
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models - [2304.06364] [QA].
NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds - [2304.06287] [QA].
Language Instructed Reinforcement Learning for Human-AI Coordination - [2304.07297] [QA].
Asymmetrically-powered Neural Image Compression with Shallow Decoders - [2304.06244] [QA].
[CLS] Token is All You Need for Zero-Shot Semantic Segmentation - [2304.06212] [QA].
Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views - [2304.06024] [QA].
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs - [2304.06020] [QA].
Can Large Language Models Transform Computational Social Science? - [2305.03514] [QA].
Hard Patches Mining for Masked Image Modeling - [2304.05919] [QA].
Representation Learning with Multi-Step Inverse Kinematics: An Efficient and Optimal Approach to Rich-Observation RL - [2304.05889] [QA].
Are Local Features All You Need for Cross-Domain Visual Place Recognition? - [2304.05887] [QA].
Mesh2Tex: Generating Mesh Textures from Image Queries - [2304.05868] [QA].
Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation - [2304.05669] [QA].
Instance-Aware Domain Generalization for Face Anti-Spoofing - [2304.05640] [QA].
ChatGPT is all you need to decolonize sub-Saharan Vocational Education - [2304.13728] [QA].
ChemCrow: Augmenting large-language models with chemistry tools - [2304.05376] [QA].
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models - [2304.05335] [QA].
OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction - [2304.05316] [QA].
SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes - [2304.05170] [QA].
Teaching Large Language Models to Self-Debug - [2304.05128] [QA].
StageInteractor: Query-based Object Detector with Cross-stage Interaction - [2304.04978] [QA].
Gradient-based Uncertainty Attribution for Explainable Bayesian Deep Learning - [2304.04824] [QA].
A Cheaper and Better Diffusion Language Model with Soft-Masked Noise - [2304.04746] [QA].
Ambiguous Medical Image Segmentation using Diffusion Models - [2304.04745] [QA].
Detection Transformer with Stable Matching - [2304.04742] [QA].
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition - [2304.04704] [QA].
Improved Test-Time Adaptation for Domain Generalization - [2304.04494] [QA].
Instance Neural Radiance Field - [2304.04395] [QA].
Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT - [2304.11116] [QA].
OpenAGI: When LLM Meets Domain Experts - [2304.04370] [QA].
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions - [2304.04227] [QA].
Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification - [2304.04205] [QA].
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training - [2304.04175] [QA].
Hi Sheldon! Creating Deep Personalized Characters from TV Shows - [2304.11093] [QA].
Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder - [2304.04052] [QA].
ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application - [2304.03893] [QA].
Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis - [2304.03869] [QA].
Why think step by step? Reasoning emerges from the locality of experience - [2304.03843] [QA].
Meta-causal Learning for Single Domain Generalization - [2304.03709] [QA].
Model-Agnostic Gender Debiased Image Captioning - [2304.03693] [QA].
Attention: Marginal Probability is All You Need? - [2304.04556] [QA].
Sheaf Neural Networks for Graph-based Recommender Systems - [2304.09097] [QA].
RED-PSM: Regularization by Denoising of Partially Separable Models for Dynamic Imaging - [2304.03483] [QA].
Generative Agents: Interactive Simulacra of Human Behavior - [2304.03442] [QA].
TopNet: Transformer-based Object Placement Network for Image Compositing - [2304.03372] [QA].
SegGPT: Segmenting Everything In Context - [2304.03284] [QA].
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention - [2304.03282] [QA].
Retention Is All You Need - [2304.03103] [QA].
MULLER: Multilayer Laplacian Resizer for Vision - [2304.02859] [QA].
Learning Neural Eigenfunctions for Unsupervised Semantic Segmentation - [2304.02841] [QA].
Segment Anything - [2304.02643] [QA].
ENTL: Embodied Navigation Trajectory Learner - [2304.02639] [QA].
HNeRV: A Hybrid Neural Representation for Videos - [2304.02633] [QA].
Dynamic Point Fields - [2304.02626] [QA].
Generative Novel View Synthesis with 3D-Aware Diffusion Models - [2304.02602] [QA].
Detecting and Grounding Multi-Modal Media Manipulation - [2304.02556] [QA].
TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration - [2304.02419] [QA].
Effective control of two-dimensional Rayleigh--Bénard convection: invariant multi-agent reinforcement learning is all you need - [2304.02370] [QA].
SMPConv: Self-moving Point Representations for Continuous Convolution - [2304.02330] [QA].
Few-shot Semantic Image Synthesis with Class Affinity Transfer - [2304.02321] [QA].
How to choose your best allies for a transferable attack? - [2304.02312] [QA].
ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks - [2304.02251] [QA].
GINA-3D: Learning to Generate Implicit Neural Assets in the Wild - [2304.02163] [QA].
FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding - [2304.02135] [QA].
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing - [2304.02051] [QA].
GlueStick: Robust Image Matching by Sticking Points and Lines Together - [2304.02008] [QA].
MonoHuman: Animatable Human Neural Field from Monocular Video - [2304.02001] [QA].
LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models - [2304.01933] [QA].
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion - [2304.01893] [QA].
Learning to Name Classes for Vision and Language Models - [2304.01830] [QA].
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation - [2304.01816] [QA].
Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification - [2304.01804] [QA].
Towards Open-Vocabulary Video Instance Segmentation - [2304.01715] [QA].
HyperCUT: Video Sequence from a Single Blurry Image using Unsupervised Ordering - [2304.01686] [QA].
On the Stability-Plasticity Dilemma of Class-Incremental Learning - [2304.01663] [QA].
Cross-Domain Image Captioning with Discriminative Finetuning - [2304.01662] [QA].
IterativePFN: True Iterative Point Cloud Filtering - [2304.01529] [QA].
Robust Outlier Rejection for 3D Registration with Variational Bayes - [2304.01514] [QA].
Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning - [2304.01482] [QA].
Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection - [2304.01464] [QA].
Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos - [2304.01436] [QA].
VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution - [2304.01434] [QA].
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling - [2304.01373] [QA].
Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver - [2304.01289] [QA].
Long-Tailed Visual Recognition via Self-Heterogeneous Integration with Knowledge Excavation - [2304.01279] [QA].
Asymptotic expansions for the maximum likelihood estimation errors of the rotating parameter of the gravitational wave from core-collapse supernovae - [2304.1267] [QA].
Neural Volumetric Memory for Visual Locomotion Control - [2304.01201] [QA].
Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data - [2304.01196] [QA].
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement - [2304.01195] [QA].
Burstormer: Burst Image Restoration and Enhancement Transformer - [2304.01194] [QA].
Navigating to Objects Specified by Images - [2304.01192] [QA].
Generative Multiplane Neural Radiance for 3D-Aware Image Generation - [2304.01172] [QA].
Generative Diffusion Prior for Unified Image Restoration and Enhancement - [2304.01247] [QA].
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model - [2304.01116] [QA].
DivClust: Controlling Diversity in Deep Clustering - [2304.01042] [QA].
Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction - [2304.00967] [QA].
Astroformer: More Data Might not be all you need for Classification - [2304.05350] [QA].
Few-shot Fine-tuning is All You Need for Source-free Domain Adaptation - [2304.00792] [QA].
Multi-Modal Representation Learning with Text-Driven Soft Masks - [2304.00719] [QA].
3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds - [2304.00690] [QA].
Metrological detection of multipartite entanglement through dynamical symmetries - [2304.0564] [QA].
UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-aware Curriculum and Iterative Generalist-Specialist Learning - [2304.00464] [QA].
Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild - [2304.00451] [QA].
When Crowd Meets Persona: Creating a Large-Scale Open-Domain Persona Dialogue Corpus - [2304.00350] [QA].
Devil is in the Queries: Advancing Mask Transformers for Real-world Medical Image Segmentation and Out-of-Distribution Localization - [2304.00212] [QA].

March 2023

Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation - [2304.00152] [QA].
On stochastic MPC formulations with closed-loop guarantees: Analysis and a unifying framework - [2304.0069] [QA].
Weakly-Supervised Text-driven Contrastive Learning for Facial Behavior Understanding - [2304.00058] [QA].
LivePose: Online 3D Reconstruction from Monocular Video with Dynamic Camera Poses - [2304.00054] [QA].
Accelerating exploration and representation learning with offline pre-training - [2304.00046] [QA].
Choose Your Weapon: Survival Strategies for Depressed AI Academics - [2304.06035] [QA].
A Survey of Large Language Models - [2303.18223] [QA].
Assessing Language Model Deployment with Risk Cards - [2303.18190] [QA].
Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction - [2303.18125] [QA].
VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization - [2303.17968] [QA].
Diffusion Action Segmentation - [2303.17959] [QA].
3D-aware Image Generation using 2D Diffusion Models - [2303.17905] [QA].
Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning - [2303.17842] [QA].
Learning Procedure-aware Video Representation from Instructional Videos and Their Narrations - [2303.17839] [QA].
Neural Microfacet Fields for Inverse Rendering - [2303.17806] [QA].
CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition - [2303.17778] [QA].
CAMEL: Communicative Agents for "Mind" Exploration of Large Scale Language Model Society - [2303.17760] [QA].
Optimal Input Gain: All You Need to Supercharge a Feed-Forward Neural Network - [2303.17732] [QA].
S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces - [2303.17712] [QA].
Self-Refine: Iterative Refinement with Self-Feedback - [2303.17651] [QA].
SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer - [2303.17605] [QA].
TiDy-PSFs: Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions - [2303.17583] [QA].
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face - [2303.17580] [QA].
Iterative Prompt Learning for Unsupervised Backlit Image Enhancement - [2303.17569] [QA].
Whose Opinions Do Language Models Reflect? - [2303.17548] [QA].
Language Models can Solve Computer Tasks - [2303.17491] [QA].
All You Need Is Sex for Diversity - [2303.17441] [QA].
WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio-Language Multimodal Research - [2303.17395] [QA].
Social Biases through the Text-to-Image Generation Lens - [2304.06034] [QA].
Mixed Autoencoder for Self-supervised Visual Representation Learning - [2303.17152] [QA].
NeILF++: Inter-Reflectable Light Fields for Geometry and Material Estimation - [2303.17147] [QA].
ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing - [2303.17096] [QA].
AutoAD: Movie Description in Context - [2303.16899] [QA].
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance - [2303.16894] [QA].
Adaptive Superpixel for Active Learning in Semantic Segmentation - [2303.16817] [QA].
TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation - [2303.16730] [QA].
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment - [2303.16634] [QA].
Adaptive Spot-Guided Transformer for Consistent Local Feature Matching - [2303.16624] [QA].
Personalised Language Modelling of Screen Characters Using Rich Metadata Annotations - [2303.16618] [QA].
Plan4MC: Skill Reinforcement Learning and Planning for Open-World Minecraft Tasks - [2303.16563] [QA].
Fair Federated Medical Image Segmentation via Client Contribution Estimation - [2303.16520] [QA].
Multi-View Azimuth Stereo via Tangent Space Consistency - [2303.16447] [QA].
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs - [2303.16434] [QA].
ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models - [2303.16421] [QA].
Are Data-driven Explanations Robust against Out-of-distribution Data? - [2303.16390] [QA].
Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples - [2303.16270] [QA].
Your Diffusion Model is Secretly a Zero-Shot Classifier - [2303.16203] [QA].
ASIC: Aligning Sparse in-the-wild Image Collections - [2303.16201] [QA].
LLaMA-Adapter: Efficient Fine-tuning o

List of Papers

October 2023

September 2023

August 2023

July 2023

June 2023

May 2023

April 2023

March 2023

About

Languages