LVBench

Introduction

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos, including TV series, sports broadcasts, and everyday surveillance footage, and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. By leveraging a combination of manual annotations and model-assisted techniques, we have created a robust video understanding question-answer dataset. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations of various baseline models reveal that current multimodal large language models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension.

Leaderboard

Accuracy scores on LVBench.

#	Model	Frames	LLM Params	Date	Overall (%)	ER (%)	EU (%)	KIR (%)	TG (%)	Rea (%)	Sum (%)
#	Model	Frames	LLM Params	Date	Overall (%)	ER (%)	EU (%)	KIR (%)	TG (%)	Rea (%)	Sum (%)	mPLUG-Owl3 Alibaba	64	7B	2024-11-23	43.5	46	41.6	42.4	41.1	47.5	40.4
	Qwen2-VL-72B Alibaba	48	72B	2024-09-20	41.3	38.0	41.1	38.3	41.4	46.5	46.6
	TimeMarker Meituan	≤128	8B	2024-10-29	41.3	42.8	39.1	34.9	38.7	38.2	48.8
	InternVL2-40B Shanghai AI Lab	16	34B	2024-08-30	39.6	37.4	39.7	43.4	31.4	42.5	41.4
	GLM-4V-Plus Zhipu AI	30	-	2024-08-30	38.3	39.9	35.8	34.8	37.7	40	32.8
	Gemini 1.5 Pro Google	3600	-	2024-06-11	33.1	32.1	30.9	39.3	31.8	27	32.8
	LLaVA-NeXT-Video-DPO (34B) Bytedance & NTU S-Lab	32	34B	2024-06-11	32.2	30.1	31.2	34.1	31.4	35	27.6
	Oryx-34B Tsinghua University & Tencent & NTU	64	34B	2024-09-30	30.4	27.4	29.2	32.1	29.1	34	39.7
	GPT-4o(2024-05-13)^* OpenAI	348	-	2024-08-30	30.8	33.0	27.4	34.5	25.0	27.5	24.1
	CogVLM2-Video Zhipu AI	24	8B	2024-08-30	28.1	28.3	27.1	31.0	25.5	25.5	38.9
	GPT-4o OpenAI	10	-	2024-06-11	27	26.5	23.7	28.3	21.4	28	32.8
	PLLaVA 34B Bytedance & NTU	16	34B	2024-06-11	26.1	25.0	24.9	26.2	21.4	30.0	25.9
	LWM UC Berkeley	>3600	7B	2024-06-11	25.5	24.7	24.8	26.5	28.6	30.5	22.4
	LLaMA-VID CUHK & SmartMore	>10800	13B	2024-06-11	23.9	25.4	21.7	23.4	26.4	26.5	17.2
	MovieChat Zhejiang University	>10000	7B	2024-06-11	22.5	21.3	23.1	25.9	22.3	24.0	17.2
	TimeChat Peking University & Huawei	>96	7B	2024-06-11	22.3	21.9	21.7	25.9	22.7	25.0	24.1

Green date indicates the newly added/updated models.

* All the frames are resized to 512x512 resolution to fit within GPT-4o’s max context length.

Statistics

(Left) Video categories. Our dataset contains 6 major categories and 21 subcategories.
(Right) Performance radar chart of different models on LVBench.

Benchmark Comparison

Comparison of different datasets. Open-domain represents whether the source of the video is diversified. Multi-type represents whether the types of questions are greater than 2 categories.

Model vs Human

LVBench evaluation results across different video categories.

The impact of different video durations.

The impact of different clue durations.

@misc{wang2024lvbench, title={LVBench: An Extreme Long Video Understanding Benchmark}, author={Weihan Wang and Zehai He and Wenyi Hong and Yean Cheng and Xiaohan Zhang and Ji Qi and Shiyu Huang and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang}, year={2024}, eprint={2406.08035}, archivePrefix={arXiv}, primaryClass={cs.CV} }

LVBench

An Extreme Long Video Understanding Benchmark

Introduction

Leaderboard

LVBench

Example

Statistics

Benchmark Comparison

Experimental Results

Answer Distribution

Model vs Human

Citation