Post

VideoLLaMA3 ํ›‘์–ด๋ณด๊ธฐ

VideoLLaMA3 ํ›‘์–ด๋ณด๊ธฐ

๐Ÿš€ VideoLLaMA 3: ์ตœ์ฒจ๋‹จ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋น„๋””์˜ค ์ดํ•ด ๋ชจ๋ธ

๐Ÿ” ๊ฐœ์š”

๐Ÿ“„ ๋…ผ๋ฌธ: https://arxiv.org/abs/2501.13106
๐Ÿ› ๏ธ GitHub: https://github.com/DAMO-NLP-SG/VideoLLaMA3

VideoLLaMA 3๋Š” ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค ์ดํ•ด๋ฅผ ์œ„ํ•œ ์ตœ์‹  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ,
์‹œ๊ฐ„์  ํŠน์„ฑ์„ ๋ฐ˜์˜ํ•œ ๋น„์ „ ์ค‘์‹ฌ(vision-centric) ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„๊ณผ ํ”„๋ ˆ์ž„์›Œํฌ ๋””์ž์ธ์„ ์ ์šฉํ•˜์—ฌ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

1


๐ŸŽฏ ์ฃผ์š” ํŠน์ง•

2

๐Ÿ”ฅ ๋น„์ „ ์ค‘์‹ฌ(vision-centric) ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„

๊ธฐ์กด์˜ ๋น„๋””์˜ค-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ ํ’ˆ์งˆ์ด ๋‚ฎ๊ฑฐ๋‚˜ ๋ถ€์กฑํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ค‘์‹ฌ์˜ ํ•™์Šต์„ ์ ์šฉํ•จ.

๐Ÿ“Œ 4๋‹จ๊ณ„ ํ•™์Šต ๊ณผ์ •
1๏ธโƒฃ ๋น„์ „ ์ธ์ฝ”๋” ์ ์‘ (Vision Encoder Adaptation)
2๏ธโƒฃ ๋น„์ „-์–ธ์–ด ์ •๋ ฌ (Vision-Language Alignment)
3๏ธโƒฃ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ํŒŒ์ธํŠœ๋‹ (Multi-task Fine-tuning)
4๏ธโƒฃ ๋น„๋””์˜ค ์ค‘์‹ฌ ํŒŒ์ธํŠœ๋‹ (Video-centric Fine-tuning)


๐ŸŽฌ ํ˜์‹ ์ ์ธ ๋น„๋””์˜ค ์ฒ˜๋ฆฌ ๊ธฐ์ˆ 

3

1๏ธโƒฃ Any-resolution Vision Tokenization (AVT)

โœ” ๋‹ค์–‘ํ•œ ํ•ด์ƒ๋„ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
โœ” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ์˜ ๊ณ ํ•ด์ƒ๋„ ์ •๋ณด ๋ณด์กด

4

2๏ธโƒฃ Differential Frame Pruner (DiffFP)

โœ” ์ค‘๋ณต ํ”„๋ ˆ์ž„์„ ์ œ๊ฑฐํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ
โœ” ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ์œ ์ง€ํ•˜์—ฌ ํšจ์œจ์ ์ธ ๋น„๋””์˜ค ์ฒ˜๋ฆฌ

3๏ธโƒฃ ๊ณ ํ’ˆ์งˆ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ

  • VL3-Syn7M ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• (7๋ฐฑ๋งŒ ๊ฐœ์˜ ๊ณ ํ’ˆ์งˆ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ)
  • OCR ๋ฐ์ดํ„ฐ, ์ฐจํŠธ ๋ถ„์„ ๋ฐ์ดํ„ฐ, ์ˆ˜ํ•™์  ์‹œ๊ฐ์  ๋ฌธ์ œ ํ•ด๊ฒฐ ๋ฐ์ดํ„ฐ ํฌํ•จ

4๏ธโƒฃ ๋Œ€๊ทœ๋ชจ ์‚ฌ์ „ ํ•™์Šต

  • OpenAI, Meta ๋“ฑ์˜ ์ตœ์‹  ์—ฐ๊ตฌ ๋ฐ˜์˜ํ•œ Qwen2.5 LLM ๋ชจ๋ธ ๊ธฐ๋ฐ˜
  • ์‚ฌ์ „ ํ›ˆ๋ จ๋œ SigLIP ๋น„์ „ ์ธ์ฝ”๋” ๊ฐœ์„ 

5


๐Ÿ“Š ์„ฑ๋Šฅ ํ‰๊ฐ€

๐Ÿ–ผ๏ธ ์ด๋ฏธ์ง€ ์ดํ•ด ์„ฑ๋Šฅ

9

10

๋ชจ๋ธChartQADocVQAMathVistaMMMU-ProRealWorldQA
VideoLLaMA 3 (7B)86.394.967.133.672.7
Qwen2-VL 7B83.094.558.231.470.1
LLaVA-OneVision80.087.563.224.166.3

๐ŸŽฌ ๋น„๋””์˜ค ์ดํ•ด ์„ฑ๋Šฅ

9

10

๋ชจ๋ธVideoMMEPerceptionTestMLVUTempCompassNextQA
VideoLLaMA 3 (7B)66.272.873.068.184.5
InternVL2.5 8B64.268.969.068.385.0
Qwen2-VL 7B63.362.369.867.981.2

โœ… ๋Œ€๋ถ€๋ถ„์˜ ๋ฒค์น˜๋งˆํฌ์—์„œ SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ!


๐Ÿ› ๏ธ ์„ค์น˜ ๋ฐ ์‚ฌ์šฉ๋ฒ•

๐Ÿ“Œ ๊ธฐ๋ณธ ํ™˜๊ฒฝ ์„ค์ •

1
2
3
4
pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118
pip install flash-attn --no-build-isolation
pip install transformers==4.46.3 accelerate==1.0.1
pip install decord ffmpeg-python imageio opencv-python

๐Ÿ“Œ ๋ชจ๋ธ ๋‹ค์šด๋กœ๋“œ

1
2
3
git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt

๐Ÿ“Œ ์ถ”๋ก (Inference) ์ฝ”๋“œ ์˜ˆ์ œ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, device_map={"": device},
    torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 180}},
            {"type": "text", "text": "What is the cat doing?"}
        ]
    },
]

inputs = processor(conversation=conversation, add_system_prompt=True, add_generation_prompt=True, return_tensors="pt")
inputs = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(response)

๐Ÿ” ํ™œ์šฉ ์‚ฌ๋ก€

๐Ÿ–ผ๏ธ ์ฐจํŠธ ๋ถ„์„ (Chart Understanding)

6

๐Ÿ“Œ ์งˆ๋ฌธ: ์ด ์ฃผ์‹์€ ๋ณด์œ ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์„๊นŒ?
๐Ÿ“Œ VideoLLaMA 3์˜ ๋‹ต๋ณ€: โ€œํ•ด๋‹น ์ฃผ์‹์€ ๋ณ€๋™์„ฑ์ด ํฌ๊ณ  ํˆฌ์ž ์œ„ํ—˜์ด ๋†’์•„ ๋ณด์ž…๋‹ˆ๋‹ค.โ€

๐Ÿ“„ OCR ๋ฐ ๋ฌธ์„œ ์ดํ•ด (Document Understanding)

7

๐Ÿ“Œ ์งˆ๋ฌธ: ๋ฌธ์„œ์˜ ๋‚ด์šฉ์„ ์š”์•ฝํ•ด ์ฃผ์„ธ์š”.
๐Ÿ“Œ VideoLLaMA 3์˜ ๋‹ต๋ณ€: โ€œ๋ฌธ์„œ์—์„œ ์ฝ์€ ์ฃผ์š” ๋‚ด์šฉ์€โ€ฆโ€

๐ŸŽฌ ๋น„๋””์˜ค ์บก์…˜ ์ƒ์„ฑ (Video Captioning)

8

๐Ÿ“Œ ์งˆ๋ฌธ: ์ด ๋น„๋””์˜ค์˜ ๋‚ด์šฉ์„ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.
๐Ÿ“Œ VideoLLaMA 3์˜ ๋‹ต๋ณ€: โ€œ์ด ๋น„๋””์˜ค๋Š” ์šฐ์ฃผ์„ ์ด ๊ถค๋„๋ฅผ ๋„๋Š” ์žฅ๋ฉด์œผ๋กœ ์‹œ์ž‘๋ฉ๋‹ˆ๋‹คโ€ฆโ€


๐Ÿš€ ๊ฒฐ๋ก 

VideoLLaMA 3๋Š” ์ตœ์‹  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ AI ๋ชจ๋ธ ์ค‘ ์ตœ๊ฐ•์˜ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉฐ, ํŠนํžˆ ๋น„๋””์˜ค ๋ฐ ์ด๋ฏธ์ง€ ์ดํ•ด์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.

โœ” ๋น„์ „ ์ค‘์‹ฌ ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„ ์ ์šฉ
โœ” SOTA ์„ฑ๋Šฅ ๋‹ฌ์„ฑ (์ตœ์‹  ๋ฒค์น˜๋งˆํฌ 1์œ„ ๊ธฐ๋ก)
โœ” ๋น„๋””์˜ค ์บก์…˜, OCR, ์ฐจํŠธ ๋ถ„์„, ๋ฌธ์„œ ์ดํ•ด ๋“ฑ ๋‹ค์–‘ํ•œ ํ™œ์šฉ ๊ฐ€๋Šฅ

This post is licensed under CC BY 4.0 by the author.