Post

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

๐Ÿ“„ ๋…ผ๋ฌธ ์ •๋ณด

โ€œVLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasksโ€ (ICCV 2025)

1. ๊ฐœ์š”

VLABench๋Š” Vision-Language-Action(VLA) ๋ชจ๋ธ๊ณผ VLM ๊ธฐ๋ฐ˜ ๋กœ๋ด‡ ์กฐ์ž‘ workflow๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ, ํ˜„์žฌ๊นŒ์ง€ ๊ฐ€์žฅ ํฐ ๊ทœ๋ชจ์˜ ์–ธ์–ด ์กฐ๊ฑด ๋กœ๋ด‡ ์กฐ์ž‘(Language-Conditioned Manipulation, LCM) ๋ฒค์น˜๋งˆํฌ๋กœ, ๊ธฐ์กด์˜ RLBench, CALVIN, LIBERO ๊ฐ™์€ task suite๋Š” ์œ ์šฉํ•˜์ง€๋งŒ, ํ˜„์‹ค์ ์ธ ์ธ๊ฐ„โ€“๋กœ๋ด‡ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์š”๊ตฌ๋˜๋Š” ๋‹ค๋‹จ๊ณ„ ์ถ”๋ก , ์ƒ์‹ ๊ธฐ๋ฐ˜ ํŒ๋‹จ, ๋ณต์žกํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ, scene ๋‹ค์–‘์„ฑ, ์ƒˆ๋กœ์šด ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๋“ฑ์„ ์ถฉ๋ถ„ํžˆ ์ธก์ •ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

์ด ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด VLABench๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

  • 100๊ฐœ ์นดํ…Œ๊ณ ๋ฆฌ์˜ ์กฐ์ž‘ ์ž‘์—…
  • 60๊ฐœ์˜ Primitive Task + 40๊ฐœ์˜ Composite Task
  • 2000๊ฐœ ์ด์ƒ์˜ 3D ๊ฐ์ฒด + ๋‹ค์–‘ํ•œ ์‹ค๋‚ด ํ™˜๊ฒฝ(scene)

2. ์™œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ํ•„์š”ํ•œ๊ฐ€?

์šฐ์„  ์™œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๊ฐ€ ํ•„์š”ํ•œ์ง€ ์ด์•ผ๊ธฐ๋ฅผ ํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ๋‹ค.

โ‘  ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๋Š” ํ…œํ”Œ๋ฆฟ ๊ธฐ๋ฐ˜ ๋ช…๋ น์— ์˜์กด โ†’ ์ž์—ฐ์–ด์˜ ๋ณต์žก์„ฑ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ

๊ธฐ์กด task๋Š” ๋Œ€๋ถ€๋ถ„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹จ์ˆœํ•˜๋‹ค

  • โ€œPick up the red blockโ€
  • โ€œOpen the drawerโ€
  • โ€œPut the apple on the plateโ€

์ด๋Ÿฐ ์–ธ์–ด๋Š”

  • ์ธ๊ฐ„์˜ ์‹ค์ œ ๋ฐœํ™”์ฒ˜๋Ÿผ ์ƒํ™ฉ์  ๋งฅ๋ฝ์ด๋‚˜ ๊ฐ์ •, ์•”์‹œ์  ์š”๊ตฌ๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๊ณ 
  • ์–ธ์–ด ๋ชจ๋ธ์˜ ์‹ฌ์ธต์  ์˜๋ฏธ ์ดํ•ด ๋Šฅ๋ ฅ์„ ์ „ํ˜€ ํ…Œ์ŠคํŠธํ•˜์ง€ ๋ชปํ•œ๋‹ค.

๋ฐ˜๋ฉด VLABench ๋ช…๋ น์€ ๋‹ค์Œ์ฒ˜๋Ÿผ ์•”๋ฌต์ /๋น„์ง์ ‘์  ํ‘œํ˜„์ด ๋งŽ๋‹ค.

  • โ€œํ—ฌ์Šค์žฅ์—์„œ ํ•œ ์‹œ๊ฐ„ ์šด๋™ํ•˜๊ณ  ์™”๋”๋‹ˆ ๋„ˆ๋ฌด ๋ชฉ์ด ๋งˆ๋ฅด๋„ค. ์‹œ์›ํ•œ ์Œ๋ฃŒ ์ข€โ€ฆโ€
  • โ€œ์ž ์‹œ ํ›„ ํŒŒ์ด์ฌ ๊ณผ์ œ๋ฅผ ํ•  ๊ฑฐ๋‹ˆ๊นŒ ์ฑ…์ƒ ์ข€ ์ค€๋น„ํ•ด์ค˜.โ€
  • โ€œ๋„ค๋œ๋ž€๋“œ์˜ ๊ตญํ™”๋ฅผ ๊ฝƒ๋ณ‘์— ๊ฝ‚์•„์ค˜.โ€

์œ„์˜ ๋ช…๋ น์„ ์ˆ˜ํ–‰ํ•˜๋ ค๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ์ •๋ณด, ์ดํ•ด ๋Šฅ๋ ฅ์ด ํ•„์š”ํ•˜๋‚˜ ๊ธฐ์กด์˜ ๋ฒค์น˜๋งˆํฌ๋Š” ์ด๋Ÿฌํ•œ ์š”๊ตฌ๋ฅผ ์ „ํ˜€ ๋‹ค๋ฃจ์ง€ ๋ชปํ•œ๋‹ค.

  • ์„ธ๊ณ„ ์ง€์‹(๋„ค๋œ๋ž€๋“œ ๊ตญํ™” = ํŠค๋ฆฝ),
  • ๊ฐ์ •/์ƒํ™ฉ ์ดํ•ด(โ€˜๋ชฉ์ด ๋งˆ๋ฅด๋‹คโ€™โ†’ ์Œ๋ฃŒ),
  • ์ž‘์—… ๋ถ„ํ•ด ๋Šฅ๋ ฅ(โ€œ์ฑ…์ƒ ์ค€๋น„โ€ โ†’ ์ •๋ฆฌ + ๋…ธํŠธ๋ถ ์—ด๊ธฐ),
  • ๋ฌผ์ฒด-ํ–‰๋™ ๋งคํ•‘

โ‘ก ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๋Š” ๋‹จ์ผ ์Šคํ‚ฌ ์ค‘์‹ฌ โ†’ ๋ณตํ•ฉ ์Šคํ‚ฌ ์กฐํ•ฉ & ๋ฉ€ํ‹ฐ์Šคํ… reasoning ํ‰๊ฐ€ ๋ถ€์กฑ

๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด task๋Š” ๋‹ค์Œ ํ•œ ๊ฐ€์ง€ ์Šคํ‚ฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.

  • ์žก๊ธฐ(Grasp)
  • ์ด๋™(Place)
  • ๋ฒ„ํŠผ ๋ˆ„๋ฅด๊ธฐ(Press)

๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ์ธ๊ฐ„ ์ง€์‹œ์˜ ์ƒ๋‹น์ˆ˜๋Š” ๋ณตํ•ฉ ์Šคํ‚ฌ์„ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์— ๊ฑธ์ณ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

์˜ˆ์‹œ: โ€œ๋ผ๋–ผ ๋งŒ๋“ค์–ด์ค˜โ€

  1. ์ปต ์žก๊ธฐ
  2. ์ปคํ”ผ๋จธ์‹  ์œ„์น˜ ํƒ์ƒ‰
  3. ๋ฒ„ํŠผ ๋ˆŒ๋Ÿฌ ์ถ”์ถœ
  4. ์šฐ์œ  ํ†ต์—์„œ ์šฐ์œ  ๋”ฐ๋ฅด๊ธฐ
  5. ์ปต์„ ์ ์ ˆํ•œ ์œ„์น˜์— ๋‘๊ธฐ

์ฆ‰, long-horizon planning + ํ•˜์œ„์ž‘์—… ๋ถ„ํ•ด(subtask decomposition) + ์„ผ์„œ ๊ธฐ๋ฐ˜ ๋ฌผ๋ฆฌ ์กฐ์ž‘์„ ๋ชจ๋‘ ์š”๊ตฌํ•œ๋‹ค. ๊ธฐ์กด์˜ RLBench๋‚˜ LIBERO๋Š” ์ด๋Ÿฐ ๋ณตํ•ฉ์  ์ถ”๋ก ์„ ์ธก์ •ํ•  ๋งŒํ•œ ๊ตฌ์กฐ๊ฐ€ ๋ถ€์กฑํ•˜๋‹ค. VLABench๋Š” ์ด๋Ÿฌํ•œ ๋ฉ€ํ‹ฐ์Šคํ… reasoning์„ ํ•„์ˆ˜์ ์œผ๋กœ ์š”๊ตฌํ•˜๋Š” Composite Task๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ธฐ์กด์˜ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค.

โ‘ข ์ƒ์‹(common sense)๊ณผ ์„ธ๊ณ„ ์ง€์‹(world knowledge)์„ ์š”๊ตฌํ•˜๋Š” task ๋ถ€์žฌ

์˜ˆ์‹œ: โ€œ๊ฐˆ์ฆ ๋‚ฌ์œผ๋‹ˆ ์ฐจ๊ฐ€์šด ์Œ๋ฃŒ๋ฅผ ๊ฐ€์ ธ์™€์ค˜.โ€

์ด ๋ช…๋ น์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•„๋ž˜์˜ ๋‚ด์šฉ์ด ํ•„์š”ํ•˜๋‹ค.

  • โ€œ์šด๋™ ํ›„ โ†’ ๊ฐˆ์ฆ โ†’ ์‹œ์›ํ•œ ์Œ๋ฃŒโ€๋ผ๋Š” ์ƒ์‹
  • ์Œ๋ฃŒ๊ฐ€ ๋ƒ‰์žฅ๊ณ  ์•ˆ์— ์žˆ์„ ํ™•๋ฅ ์ด ๋†’๋‹ค๋Š” ๋งฅ๋ฝ
  • ์Œ๋ฃŒ๋ณ‘์„ ์ง‘๋Š” ์Šคํ‚ฌ
  • ์ปต๊ณผ ์Œ๋ฃŒ๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐ ๋Šฅ๋ ฅ

๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๋Š” ์ƒ์‹์„ ์š”๊ตฌํ•˜์ง€ ์•Š๋Š”๋‹ค. VLABench๋Š” ์ด๋ฅผ task ์š”๊ตฌ์‚ฌํ•ญ์˜ ํ•ต์‹ฌ ์š”์†Œ๋กœ ๋ช…์‹œ์ ์œผ๋กœ ๋ฐ˜์˜ํ•œ ์ตœ์ดˆ์˜ ๋ฒค์น˜๋งˆํฌ์ด๋‹ค.

โ‘ฃ ๊ธฐ์กด ์ผ๋ฐ˜ํ™” ํ‰๊ฐ€์˜ ํ•œ๊ณ„ โ†’ VLABench๋Š” Category-Level Unseen Generalization ์ œ์•ˆ

๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๋Š” ๋™์ผ ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด์—์„œ์˜ ๋ณ€ํ˜•๋งŒ์„ ํ‰๊ฐ€ํ•˜๋Š” instance-level generalization ์ค‘์‹ฌ์ด๋‹ค.

  • Train: ๋นจ๊ฐ„ ์‚ฌ๊ณผ
  • Test: ์ดˆ๋ก ์‚ฌ๊ณผ

๊ทธ๋Ÿฌ๋‚˜ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๋กœ๋ด‡์€ ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๊ฐ์ฒด๋ฅผ ๋งˆ์ฃผํ•˜๊ฒŒ ๋œ๋‹ค.

  • Train: ์‚ฌ๊ณผ, ๋ฐ”๋‚˜๋‚˜
  • Test: ๋ ˆ๋ชฌ, ํ‚ค์œ„, ๋”ธ๊ธฐ

์ด์— ๋”ฐ๋ผ VLABench๋Š” ๋ณด๋‹ค ํ˜„์‹ค์ ์ธ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์นดํ…Œ๊ณ ๋ฆฌ ๊ธฐ๋ฐ˜ ๊ตฌ์„ฑ ๋ฐฉ์‹์„ ๋„์ž…ํ•œ๋‹ค.

  • Seen categories: ์‚ฌ๊ณผ, ๋ฐ”๋‚˜๋‚˜, ๋ฐฐ
  • Unseen categories: ๋ ˆ๋ชฌ, ํ‚ค์œ„, ๋ง๊ณ 

์ด๋Š” ๋‹จ์ˆœํ•œ ์‹œ๊ฐ ์ •๋ณด ์ผ๋ฐ˜ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋Šฅ๋ ฅ์„ ์š”๊ตฌํ•˜๊ฒŒ ๋œ๋‹ค.

  • ์–ธ์–ด์  ์˜๋ฏธ ์ดํ•ด (semantic grounding)
  • ์ƒ์‹ ๊ธฐ๋ฐ˜ ์ถ”๋ก  (commonsense reasoning)
  • ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๋ฌผ๋ฆฌ์  ์†์„ฑ ๋ฐ affordance ์ดํ•ด

โ‘ค ๊ธฐ์กด ํ™˜๊ฒฝ์˜ ์‹œ๊ฐ์  ๋‹ค์–‘์„ฑ ๋ถ€์กฑ โ†’ VLABench๋Š” ๋Œ€๊ทœ๋ชจยท๊ณ ๋‹ค์–‘์„ฑ ๊ฐ์ฒด ๋ฐ Scene ์ œ๊ณต

๊ธฐ์กด task suite ํ™˜๊ฒฝ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค.

  • ์ œํ•œ๋œ ๊ฐ์ฒด ์ˆ˜์™€ ๋‹จ์กฐ๋กœ์šด ํ˜•ํƒœ
  • Mesh ๋ฐ ํ…์Šค์ฒ˜ ๋‹ค์–‘์„ฑ ๋ถ€์กฑ
  • Scene randomization ๋ถ€์กฑ
  • Distractor object์˜ ๋ถ€์žฌ๋กœ ์ธํ•œ ๋‚ฎ์€ ๋‚œ์ด๋„

์ด๋Š” ๋ชจ๋ธ์ด ํŠน์ • ๊ฐ์ฒดยท์žฅ๋ฉด์— ๊ณผ์ ํ•ฉ(overfitting) ๋˜๊ธฐ ์‰ฌ์šด ๊ตฌ์กฐ์ด๋‹ค.

VLABench๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ™˜๊ฒฝ ์„ค๊ณ„๋ฅผ ์ œ๊ณตํ•œ๋‹ค.

  • 2,000๊ฐœ ์ด์ƒ ๊ณ ํ’ˆ์งˆ 3D object library
  • ๋‹ค์–‘ํ•œ texture, geometry, background, lighting ์„ค์ •
  • Multiple camera ๊ธฐ๋ฐ˜ ์‹œ์  ๋‹ค์–‘ํ™”
  • Distractor objects์˜ ๋น„์ •ํ˜• ๋ฐฐ์น˜

์ด๋ฅผ ํ†ตํ•ด ๋ณด๋‹ค ์‚ฌ์‹ค์ ์ด๊ณ  ๋ณต์žกํ•œ ํ™˜๊ฒฝ์„ ๊ตฌ์„ฑํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ ์‹œ๊ฐยท์–ธ์–ดยทํ–‰๋™ ์ง€๋Šฅ ํ‰๊ฐ€๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.

Desktop View
Table 1: Comparison of Popular Benchmarks in Robot Learning

3. VLABench์˜ ๊ตฌ์„ฑ

VLABench๋Š” ์ธ๊ฐ„ ์ˆ˜์ค€์˜ ์–ธ์–ดยท์ง€์‹ยทํ–‰๋™ ๊ธฐ๋ฐ˜ ์ธ๊ณต ์ผ๋ฐ˜์ง€๋Šฅ(Embodied AGI)์„ ๋ชฉํ‘œ๋กœ ํ•˜๋ฉฐ, ์ด 100๊ฐœ์˜ ์กฐ์ž‘ ์ž‘์—…(Task Suite) ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. ๋ชจ๋“  task๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ™˜๊ฒฝ์—์„œ ์ž์—ฐ์–ด๋กœ ๋ช…๋ น์ด ์ฃผ์–ด์ง€๋ฉฐ, ๋ชจ๋ธ์€ ์–ธ์–ด ์ดํ•ด, ์‹œ๊ฐ ์ธ์‹, ํ–‰๋™ ๊ณ„ํš์„ ํ†ตํ•ฉ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•œ๋‹ค.

โœ” 3.1 Primitive Tasks (์ด 60๊ฐœ)

Primitive task๋Š” ๋‹จ์ผ ํ•ต์‹ฌ ๋Šฅ๋ ฅ ์š”์†Œ(capability primitive) ๋ฅผ ์ง์ ‘์ ์œผ๋กœ ํ‰๊ฐ€ํ•˜๋„๋ก ์„ค๊ณ„๋œ ๊ธฐ๋ณธ ์ž‘์—… ์ง‘ํ•ฉ์ด๋‹ค. ๊ฐ task๋Š” ๋†’์€ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ํŠน์ • ํ•˜์œ„ ๋Šฅ๋ ฅ์˜ ์‹คํŒจ ์›์ธ์„ ๋ช…ํ™•ํžˆ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.

Primitive task๊ฐ€ ํ‰๊ฐ€ํ•˜๋Š” ์ฃผ์š” ๋Šฅ๋ ฅ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

๋Šฅ๋ ฅ ์š”์†Œ์„ค๋ช…
Mesh Recognition๋™์ผํ•œ texture๋ผ๋„ geometry ์ฐจ์ด๋ฅผ ๊ตฌ๋ถ„
Texture Recognitiongeometry๊ฐ€ ๋™์ผํ•ด๋„ ์‹œ๊ฐ์  texture ๊ธฐ๋ฐ˜ ์‹๋ณ„
Spatial Understanding์œ„์น˜ ๊ด€๊ณ„, ๋ฐฉํ–ฅ์„ฑ, ์ƒ๋Œ€์  ๋ฐฐ์น˜ ์ดํ•ด
Semantic Understanding์–ธ์–ด์  ์˜๋ฏธ์™€ ๊ฐ์ฒด mapping
Physical Reasoning๋ฌด๊ฒŒ, ๋ถ€๋ ฅ, ์ ์žฌ ๊ฐ€๋Šฅ์„ฑ, ์•ˆ์ •์„ฑ ๋“ฑ ๋ฌผ๋ฆฌ์  ํŠน์„ฑ ์ดํ•ด
Common Sense์ธ๊ฐ„ ์ƒํ™œ ๊ธฐ๋ฐ˜์˜ ๊ธฐ๋ณธ ์ƒ์‹์  ํŒ๋‹จ ํฌํ•จ

์˜ˆ์‹œ:

  • โ€œPick up the metallic cup, not the paper one.โ€
  • โ€œPlace the banana above the plate, not under the desk.โ€

๋ชฉ์ : ๋ถ„๋ฆฌ๋œ ๋Šฅ๋ ฅ(component-wise skill) ์˜ ์ •๋Ÿ‰์  ํ‰๊ฐ€ ๋ฐ diagnostic ๋ชฉ์ .

โœ” 3.2 Composite Tasks (์ด 40๊ฐœ)

Composite task๋Š” ์ธ๊ฐ„ ์˜๋„, ์ƒํ™ฉ ์ถ”๋ก , ์žฅ๊ธฐ ๊ณ„ํš(long-horizon) ๊ณผ ๊ฐ™์€ ๊ณ ์ฐจ์› ์ง€๋Šฅ ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค. ๋ณต์ˆ˜์˜ primitive skill์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฒฐํ•ฉํ•ด์•ผ ํ•˜๋ฉฐ, ์ถ”๋ก (reasoning) ๊นŠ์ด๊ฐ€ ๋†’์€ ์‹œ๋‚˜๋ฆฌ์˜ค ์ค‘์‹ฌ ์„ค๊ณ„๋ฅผ ๊ฐ€์ง„๋‹ค.

Composite task ํŠน์ง•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์–ธ์–ด์  ๋งฅ๋ฝ(context) ๋ฐ ์ˆจ์€ ์˜๋„(implicit intent) ๋ฅผ ์ดํ•ดํ•ด์•ผ ํ•จ
  • ๋‹จ์ผ ํ–‰๋™์ด ์•„๋‹Œ ๋‹ค๋‹จ๊ณ„ ์‹œํ€€์Šค(action sequence) ์ˆ˜ํ–‰ ํ•„์š”
  • ์ƒํ™ฉ/ํ™˜๊ฒฝ ๊ธฐ๋ฐ˜ ์กฐ๊ฑด๋ถ€ ๊ณ„ํš(conditional execution) ์š”๊ตฌ
  • ๋ช…์‹œ๋˜์ง€ ์•Š์€ ๋Œ€์ƒ/๋„๊ตฌ๋ฅผ ์ถ”๋ก ์ ์œผ๋กœ ์„ ํƒํ•ด์•ผ ํ•จ

์˜ˆ์‹œ:

  • โ€œ์นœ๊ตฌ๊ฐ€ ๊ณง ์˜ฌ ์˜ˆ์ •์ด๋‹ˆ ํ…Œ์ด๋ธ”์„ ์ •๋ฆฌํ•˜๊ณ , ๊ฝƒ๋ณ‘์„ ์ค‘์•™์— ๋‘๊ณ , ๊นจ๋—ํ•œ ์ปต์— ๋ฌผ์„ ์ฑ„์›Œ ์ค€๋น„ํ•ด์ค˜.โ€
  • โ€œ์šด๋™ํ•˜๊ณ  ์™€์„œ ๋„ˆ๋ฌด ๋”์šฐ๋‹ˆ, ์‹œ์›ํ•œ ์Œ๋ฃŒ๋ฅผ ํ•˜๋‚˜ ๊ฐ€์ ธ๋‹ค์ค˜.โ€

์ฆ‰, Composite task๋Š” ๋‹ค์Œ ๋Šฅ๋ ฅ์„ ๋ณตํ•ฉ์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ๋‹ค.

  • ๊ณ ์ˆ˜์ค€ ์ž์—ฐ์–ด ์ถ”๋ก 
  • ๊ณผ์ œ ๋ถ„ํ•ด(Task decomposition)
  • ์‹คํ–‰ ์ˆœ์„œ ๊ฒฐ์ •(Planning)
  • ์ค‘๊ฐ„ ๋ชฉํ‘œ ์„ค์ •(Subgoal inference)
  • ์„ ํƒ์  ํ–‰๋™ ์ „๋žต(Adaptive reasoning)

๋ชฉ์ : Human-level embodied reasoning & planning ์˜ ์‹ค์งˆ์ ์ธ ํ‰๊ฐ€.

4. ํ‰๊ฐ€ ๋ฐฉ์‹ (Benchmark Protocol)

4.1 ํ‰๊ฐ€ ๋Œ€์ƒ ์‹œ์Šคํ…œ ๊ทธ๋ฃน

VLABench๋Š” ๋‹จ์ผ ๋ชจ๋ธ ํ˜•ํƒœ์— ๊ตญํ•œ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ embodied AI architecture ๋ฅผ ๋น„๊ต ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์Œ 3๊ฐ€์ง€ ๊ทธ๋ฃน์„ ์ •์˜ํ•œ๋‹ค.

๊ทธ๋ฃนํฌํ•จ ๋ชจ๋ธํ‰๊ฐ€ ๋ชฉ์ 
VLAOpenVLA, RDT ๋“ฑEnd-to-end ๋Šฅ๋ ฅ ํ‰๊ฐ€
VLM-based WorkflowVoxPoser, CoPa ๋“ฑModular pipeline ๊ธฐ๋ฐ˜ planning quality
Pure VLMGPT-4o, Qwen2-VL ๋“ฑํ™˜๊ฒฝ์— ์ง์ ‘ ์ž‘๋™ ๋ถˆ๊ฐ€ ์‹œ reasoning capability ํ‰๊ฐ€

์ฆ‰, ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ๋กœ๋ด‡ ๋ชจ๋ธ๋ฟ ์•„๋‹ˆ๋ผ, ์–ธ์–ด๊ธฐ๋ฐ˜ ๊ณ„ํš ๋ฐ ์ถ”๋ก  ๋Šฅ๋ ฅ ์ž์ฒด๋„ ๋…๋ฆฝ ํ‰๊ฐ€ํ•œ๋‹ค.

4.2 Generalization ํ‰๊ฐ€ ์„ค๊ณ„

VLABench๋Š” ๊ธฐ์กด Instance-level generalization์„ ๋„˜์–ด์„œ, Semantic Category-level Generalization ์„ ํ•ต์‹ฌ ํ‰๊ฐ€ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.

๊ธฐ์กด ๋ฒค์น˜๋งˆํฌVLABench
์ƒ‰์ƒ/ํฌ๊ธฐ ์ฐจ์ด ์ˆ˜์ค€ generalization์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฒ”์ฃผ(category) ๋“ฑ์žฅ
Apple (Train) โ†’ Green Apple (Test)Apple, Banana (Train) โ†’ Lemon, Kiwi (Test)
Vision-based domain generalizationVision + Language + World Knowledge

์ด ๋ฐฉ์‹์€ Semantic Transfer ๋ฅผ ํ‰๊ฐ€ํ•˜๋ฉฐ ๋‹ค์Œ ์š”์†Œ๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค.

  • ๋ฏธํ•™์ /์žฌ๋ฃŒ์  ํŠน์ง• ์œ ์ถ”
  • ์–ธ์–ด์  ๋ฒ”์ฃผ ๊ณ„์ธต(structural taxonomy) ์ดํ•ด
  • ์œ ์‚ฌ affordance ๊ธฐ๋ฐ˜ ํ–‰๋™ ๊ณ„ํš

์ฆ‰,

โ€œ๋ณธ ์ ์€ ์—†์ง€๋งŒ, ํ•ด๋‹น ๊ทธ๋ฃน์— ์†ํ•œ๋‹ค๋Š” ์˜๋ฏธ๋ก ์  ์ถ”๋ก ์„ ํ†ตํ•œ ํ–‰๋™ ๊ณ„ํš ๊ฐ€๋Šฅ์„ฑโ€ ์„ ํ‰๊ฐ€ํ•œ๋‹ค.

4.3 ์ƒˆ๋กœ์šด Metric โ€“ Progress Score (PS)

์„ฑ๊ณต/์‹คํŒจ(binary) ๊ธฐ๋ฐ˜ ๊ธฐ์กด metric์€ ์žฅ๊ธฐ ๊ณ„ํš(long-horizon) ๊ณผ ๋ณตํ•ฉ reasoning task ๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ์— ๋ถˆ์ถฉ๋ถ„ํ•˜๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด VLABench๋Š” Progress Score (PS) ๋ผ๋Š” ์—ฐ์†ํ˜• partial-credit metric ์„ ๋„์ž…ํ•˜์˜€๋‹ค.

PS๊ฐ€ ๊ณ ๋ คํ•˜๋Š” ๊ตฌ์„ฑ ์š”์†Œ

์š”์†Œ์„ค๋ช…
Completion Accuracy๋ชฉํ‘œ ๊ฐ์ฒด ๋ฐ ๋ฆฌ์…‰ํ„ฐํด์˜ ์ •ํ™•๋„
Action Progress์ „์ฒด plan ์ค‘ ์–ด๋А ๋‹จ๊ณ„๊นŒ์ง€ ๋„๋‹ฌํ–ˆ๋Š”์ง€
Subgoal Achievement์ค‘๊ฐ„ ๋‹จ๊ณ„ ์„ฑ๊ณต ์—ฌ๋ถ€ ๊ฒ€์ฆ
Error Severity์‹คํŒจ ์œ„์น˜ ๋ฐ ์‹คํŒจ ํŒจํ„ด์„ ์ฐจ๋“ฑ ๋ฐ˜์˜

๊ณต์‹์  ํ‘œํ˜„ ๊ฐœ๋…

PS๋Š” ๋‹ค์Œ์˜ ๊ฐ€์ค‘ ์กฐํ•ฉ์œผ๋กœ ๊ณ„์‚ฐ๋œ๋‹ค.

PS = (Object Accuracy) ร— ฮฑ + (Action Progress) ร— (1 โˆ’ ฮฑ)

๋‹จ,

  • ฮฑ๋Š” task ์„ฑ๊ฒฉ๋ณ„ weight parameter (e.g., 0.5 ~ 0.7 ์‚ฌ์ด tunable)
  • Object Accuracy๋Š” target-object grounding ์ •ํ™•๋„
  • Action Progress๋Š” sub-step coverage ๋น„์œจ

์˜ˆ์‹œ

๋‹จ๊ณ„ํ–‰๋™PS ์ฆ๊ฐ€
Step 1์˜ฌ๋ฐ”๋ฅธ ๋ฌผ์ฒด ์ธ์‹+0.2
Step 2์˜ฌ๋ฐ”๋ฅธ grasp+0.2
Step 3์šด๋ฐ˜ ๋ฐ ์œ„์น˜ ์ด๋™+0.2
Step 4๋ฆฌ์…‰ํ„ฐํด์— ์ •ํ™•ํžˆ ๋ฐฐ์น˜+0.4

์ •ํ™•ํžˆ ์™„๋ฃŒํ•˜์ง€ ๋ชปํ•ด๋„, ์–ผ๋งˆ๋‚˜ ์ •๋‹ต์— ๊ทผ์ ‘ํ–ˆ๋Š”์ง€๊ฐ€ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ฐ˜์˜๋œ๋‹ค.

์ด metric์€ long-horizon learning curve ๋น„๊ต, partial reinforcement feedback ์—ฐ๊ตฌ, skill transfer ๋ถ„์„์— ์ง์ ‘ ํ™œ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

This post is licensed under CC BY 4.0 by the author.