Introduction

Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos, including TV series, sports broadcasts, and everyday surveillance footage, and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. By leveraging a combination of manual annotations and model-assisted techniques, we have created a robust video understanding question-answer dataset. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations of various baseline models reveal that current multimodal large language models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension.

Leaderboard

Accuracy scores on LVBench.

# Model Frames LLM
Params
Date Overall (%) ER (%) EU (%) KIR (%) TG (%) Rea (%) Sum (%)
mPLUG-Owl3

Alibaba

64 7B 2024-11-23 43.5 46 41.6 42.4 41.1 47.5 40.4
Qwen2-VL-72B

Alibaba

48 72B 2024-09-20 41.3 38.0 41.1 38.3 41.4 46.5 46.6
TimeMarker

Meituan

≤128 8B 2024-10-29 41.3 42.8 39.1 34.9 38.7 38.2 48.8
InternVL2-40B

Shanghai AI Lab

16 34B 2024-08-30 39.6 37.4 39.7 43.4 31.4 42.5 41.4
GLM-4V-Plus

Zhipu AI

30 - 2024-08-30 38.3 39.9 35.8 34.8 37.7 40 32.8
Gemini 1.5 Pro

Google

3600 - 2024-06-11 33.1 32.1 30.9 39.3 31.8 27 32.8
LLaVA-NeXT-Video-DPO (34B)

Bytedance & NTU S-Lab

32 34B 2024-06-11 32.2 30.1 31.2 34.1 31.4 35 27.6
Oryx-34B

Tsinghua University & Tencent & NTU

64 34B 2024-09-30 30.4 27.4 29.2 32.1 29.1 34 39.7
GPT-4o(2024-05-13)*

OpenAI

348 - 2024-08-30 30.8 33.0 27.4 34.5 25.0 27.5 24.1
CogVLM2-Video

Zhipu AI

24 8B 2024-08-30 28.1 28.3 27.1 31.0 25.5 25.5 38.9
GPT-4o

OpenAI

10 - 2024-06-11 27 26.5 23.7 28.3 21.4 28 32.8
PLLaVA 34B

Bytedance & NTU

16 34B 2024-06-11 26.1 25.0 24.9 26.2 21.4 30.0 25.9
LWM

UC Berkeley

>3600 7B 2024-06-11 25.5 24.7 24.8 26.5 28.6 30.5 22.4
LLaMA-VID

CUHK & SmartMore

>10800 13B 2024-06-11 23.9 25.4 21.7 23.4 26.4 26.5 17.2
MovieChat

Zhejiang University

>10000 7B 2024-06-11 22.5 21.3 23.1 25.9 22.3 24.0 17.2
TimeChat

Peking University & Huawei

>96 7B 2024-06-11 22.3 21.9 21.7 25.9 22.7 25.0 24.1

Green date indicates the newly added/updated models.

* All the frames are resized to 512x512 resolution to fit within GPT-4o’s max context length.

LVBench

Example

Statistics

data-composition

(Left) Video categories. Our dataset contains 6 major categories and 21 subcategories.
(Right) Performance radar chart of different models on LVBench.

Benchmark Comparison

data-composition

Comparison of different datasets. Open-domain represents whether the source of the video is diversified. Multi-type represents whether the types of questions are greater than 2 categories.

Experimental Results

Answer Distribution

grade-lv

Distribution of answers generated by different models.

Model vs Human

LVBench evaluation results across different video categories.

Citation

@misc{wang2024lvbench,
      title={LVBench: An Extreme Long Video Understanding Benchmark},
      author={Weihan Wang and Zehai He and Wenyi Hong and Yean Cheng and Xiaohan Zhang and Ji Qi and Shiyu Huang and Bin Xu and Yuxiao Dong and Ming Ding and Jie Tang},
      year={2024},
      eprint={2406.08035},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}