Video-Panels

Leaderboard

Performance evaluation across multiple benchmarks demonstrating the benefits of our approach. Best results within each model pair are in bold.

Avg. Duration (s): VMME Medium: 516 VMME Long: 2467 TimeScope Short: 2586 TimeScope Long: 27600

By default, this leaderboard is sorted by the Average score. To view other sorted results, please click on the corresponding header.

666

#	Model	Frames	VMME (%)			TimeScope (%)		MLVU (%)	MF2 (%)	VNBench (%)	Avg. (%)
#	Model	Frames	Medium	Long	Overall	Short	Long	MLVU (%)	MF2 (%)	VNBench (%)	Avg. (%)
Small-context VLMs
1	Video-LLaVA 7B	8	36.6	32.6	37.1	24.4	17.6	45.7	50.4	27.8	33.8
2	+ ours	8	37.9	34.2	38.7	25.6	17.1	45.7	50.2	32.0	34.8 (+1.0)
3	VideoChat2-HD	16	26.4	24.8	25.2	21.2	19.8	49.2	50.0	27.9	32.2
4	+ ours	16	26.7	25.1	25.4	21.3	19.8	49.8	50.0	28.5	32.5 (+0.3)
Medium-context VLMs
5	LLaVA-OV 0.5B	32	40.9	37.0	43.8	49.4	25.6	44.0	50.1	39.8	42.1
6	+ ours	32	42.6	36.6	44.3	56.9	30.0	43.1	50.2	41.0	44.3 (+1.2)
7	LLaVA-OV 7B	32	56.7	48.8	58.5	58.7	30.2	62.9	51.5	54.8	52.8
8	+ ours	32	56.2	50.2	58.9	69.5	33.8	65.3	52.1	57.7	56.2 (+3.4)
9	LLaVA-OV 72B	32	62.9	57.6	66.0	59.1	33.8	21.6	56.6	59.4	49.4
10	+ ours	32	66.4	59.3	67.7	70.0	32.4	23.8	58.5	62.6	52.5 (+3.1)
11	Qwen-2.5VL 7B	32	61.6	51.2	61.9	52.8	28.7	60.1	52.6	55.6	51.9
12	+ ours	32	64.0	54.7	63.9	60.8	30.0	64.9	53.8	58.5	55.3 (+3.4)
13	LLaVA-Video 7B	64	62.3	53.6	64.3	64.8	34.7	66.2	52.8	-	56.6
14	+ ours	64	62.2	54.0	64.4	79.2	39.3	66.0	54.4	-	60.7 (+4.1)
15	LLaVA-Video 72B	64	67.8	61.2	69.8	65.4	30.9	52.9	58.2	-	55.4
16	+ ours	64	68.6	61.4	70.1	75.7	30.9	54.4	59.9	-	58.2 (+2.8)
Long-context VLMs
17	Qwen-2VL 7B	180	62.7	51.0	62.4	66.1	23.8	65.7	54.3	-	54.5
18	+ ours	180	62.9	52.7	63.0	71.2	26.7	65.8	56.7	-	56.7 (+2.2)
19	Qwen-2.5VL 7B	180	67.6	54.8	66.0	73.9	37.6	66.7	54.2	-	59.7
20	+ ours	180	66.9	56.1	66.3	79.1	35.6	66.8	54.8	-	60.5 (+0.8)
21	VideoLLaMA 3 7B	180	64.6	54.1	65.3	80.2	39.1	47.3	58.9	-	58.2
22	+ ours	180	63.7	55.1	65.3	87.2	46.7	47.1	58.3	-	60.9 (+2.7)

* Since the videos of VNBench are very short, we evaluate only VLMs with up to 32 frames context on it.
* Our method improves the baseline results in nearly all cases and on average across all VLMs.

Video Panels for Long Video Understanding

Plug-and-play, model-agnostic, parameter free visual prompting

Introduction

Leaderboard