roboMamba

The repo of paper RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Our main contributions are :

We innovatively integrate a vision encoder with the efficient Mamba language model to construct our end-to-end RoboMamba, which possesses visual common sense and robot-related reasoning abilities.
To equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy using a simple policy head. We find that once RoboMamba achieves sufficient reasoning capabilities, it can acquire pose prediction skills with minimal cost.
In our extensive experiments, RoboMamba excels in reasoning on general and robotic evaluation benchmarks, and showcases impressive pose prediction results in both simulation and real-world experiments

Table 1: Comparison of general reasoning abilities with previous MLLMs on several benchmarks.

Method	LLM Size	Res.	OKVQA	VQAV2	GQA	VizWiz	OCR-VQA	POPE	MME	MMB	MM-Vet
BLIP-2	7B	224	45.9	-	41.0	19.6	40.6	85.3	1293.8	-	22.4
InstructBLIP	7B	224	-	-	49.5	33.4	44.8	-	-	36	26.2
LLaMA-AdapterV2	7B	336	49.6	70.7	45.1	39.8	-	-	1328.4	-	-
MiniGPT-v2	7B	448	57.8	-	60.1	53.6	-	-	-	-	-
Qwen-VL	7B	448	58.6	79.5	59.3	35.2	75.7	-	-	38.2	-
LLaVA1.5	7B	336	-	78.5	62.0	50.0	-	85.9	1510.7	64.3	30.5
SPHINX	7B	224	62.1	78.1	62.6	39.9	66.0	80.7	1476.1	66.9	36.0
LLaVA-Phi	2.7B	336	-	71.4	-	35.9	-	85.0	1335.1	59.8	28.9
MobileVLM	2.7B	336	-	-	59.0	-	-	84.9	1288.9	59.6	-
TinyLLaVA	2.7B	336	-	77.7	61.0	-	-	86.3	1437.3	68.3	31.7
RoboMamba(Ours)	2.7B	224	63.1	80.3	62.4	55.0	62.5	85.3	1314.8	64.2	28.6
RoboMamba(Ours)	2.7B	384	62.4	79.1	64.4	55.0	66.7	86.9	1354.2	65.7	29.7

Table 2: Comparison of the success rates between RoboMamba and baselines across various training (seen) and test (unseen) categories.

Code is coming soon !

About

The repo of paper `RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation`