Post

BadRobot: Jailbreaking Embodied LLMs in the Physical World

BadRobot: Jailbreaking Embodied LLMs in the Physical World

๐Ÿ“„ ๋…ผ๋ฌธ ์ •๋ณด

โ€œBadRobot: Jailbreaking Embodied LLMs in the Physical Worldโ€ (ICLR 2025)

๐Ÿ” ๋…ผ๋ฌธ ๊ฐœ์š”

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
โ€“ Isaac Asimovโ€™s First Law of Robotics

๐Ÿค– Embodied LLM์˜ ๋“ฑ์žฅ

Embodied AI๋Š” ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์—์„œ ํ™œ๋™ํ•˜๋Š” ์ธ๊ณต์ง€๋Šฅ ์‹œ์Šคํ…œ์œผ๋กœ, ์ธ๊ฐ„๊ณผ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ƒํ˜ธ์ž‘์šฉ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ตœ๊ทผ์—๋Š” LLM (Large Language Model)๊ณผ MLLM (Multimodal LLM)์˜ ๋ฐœ์ „์œผ๋กœ ์ž์—ฐ์–ด ์ดํ•ด ๋ฐ ๊ณ„ํš ์ˆ˜๋ฆฝ ๋Šฅ๋ ฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ, OpenVLA์™€ ๊ฐ™์€ Vision Language Action ๋ชจ๋ธ๋„ ๊ฐœ๋ฐœ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Ÿฐ ๋ชจ๋ธ์„ ๋กœ๋ด‡ ์‹œ์Šคํ…œ์— ํ†ตํ•ฉํ•œ embodied LLMs๋Š” ๊ธฐ์กด ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ๋” ๋‚˜์€ ์ผ๋ฐ˜ํ™”, ํ™˜๊ฒฝ ์ ์‘์„ฑ, ์ž‘์—… ๊ณ„ํš ๋Šฅ๋ ฅ์„ ๋ณด์ธ๋‹ค๊ณ  ํ•œ๋‹ค.

โš ๏ธ ๋ฌธ์ œ ์ œ๊ธฐ: ์•ˆ์ „ ์ด์Šˆ

Embodied LLM์€ ๋ฌผ๋ฆฌ์  ๋ชธ์ฒด(๋กœ๋ด‡ ๋“ฑ)์— ํ†ตํ•ฉ๋œ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ๋งํ•œ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•ด ์–ธ์–ด ๋ชจ๋ธ์ด ๋กœ๋ด‡์˜ โ€˜๋‘๋‡Œโ€™ ์—ญํ• ์„ ํ•˜๋ฉฐ ์‹ค์ œ ์„ธ์ƒ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š” ์‹œ์Šคํ…œ์„ ์˜๋ฏธํ•˜๋Š”๋ฐ, ์ด๋Š” ์‹ค์ œ๋กœ ๋ฌผ๋ฆฌ์  ์„ธ๊ณ„์™€ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ChatGPT์™€ ๊ฐ™์ด ๋‹จ์ˆœํžˆ ์–ธ์–ด๋กœ๋งŒ ๋ฐ˜์‘ํ•˜๋Š” ์ฑ—๋ด‡๊ณผ๋Š” ๋‹ค๋ฅด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์‚ฌ์šฉ์ž๊ฐ€ โ€œ์ปต์„ ์ง‘์–ด์ค˜โ€๋ผ๊ณ  ๋งํ•˜๋ฉด, ํ•ด๋‹น ์–ธ์–ด๋ฅผ ์ดํ•ดํ•œ LLM์ด ํ–‰๋™ ๊ณ„ํš์„ ์„ธ์šฐ๊ณ , ๋กœ๋ด‡ ํŒ”์„ ์ œ์–ดํ•˜์—ฌ ์‹ค์ œ๋กœ ์ปต์„ ์ง‘๋Š” ๋™์ž‘์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด, ๊ธฐ์กด์˜ LLM์—์„œ ๋ฌธ์ œ๊ฐ€ ๋˜์—ˆ๋˜ Jailbreak ๊ณต๊ฒฉ, ์ฆ‰ ๋ชจ๋ธ์˜ ์ œํ•œ์„ ์šฐํšŒํ•˜์—ฌ ๊ธˆ์ง€๋œ ์ถœ๋ ฅ์„ ์œ ๋„ํ•˜๋Š” ๋ฐฉ์‹์ด embodied LLM์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์˜๋ฌธ์ด ์ œ๊ธฐ๋˜์—ˆ์œผ๋ฉฐ, ํŠนํžˆ, ์–ธ์–ด์  ์ถœ๋ ฅ์— ๊ทธ์น˜์ง€ ์•Š๊ณ  ์‹ค์ œ ๋กœ๋ด‡์˜ ๋ฌผ๋ฆฌ์  ํ–‰๋™์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ, ์ž ์žฌ์ ์ธ ์œ„ํ—˜์€ ํ›จ์”ฌ ๋” ํฌ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹คํ—˜ ๊ฒฐ๊ณผ, ๊ธฐ์กด์— ์ธํ„ฐ๋„ท์—์„œ ๋„๋ฆฌ ๊ณต์œ ๋œ jailbreak ํ”„๋กฌํ”„ํŠธ๋“ค์€ embodied LLM ํ™˜๊ฒฝ์—์„œ ๊ฑฐ์˜ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•˜์˜€๋‹ค. ์ด๋Š” ์ผ๋ฐ˜ LLM์—์„œ ํ†ตํ•˜๋˜ ๊ณต๊ฒฉ ๋ฐฉ์‹์ด ๋กœ๋ด‡์— ํ†ตํ•ฉ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Œ์„ ์˜๋ฏธํ•œ๋‹ค. ๊ทธ ์ด์œ ๋Š”, embodied LLM์ด ์ผ๋ฐ˜์ ์ธ ์ฑ—๋ด‡๊ณผ ๋‹ฌ๋ฆฌ ๋กœ๋ด‡ ์ œ์–ด์— ํ•„์š”ํ•œ ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ์™€ ํ™˜๊ฒฝ ์กฐ๊ฑด์„ ๋‚ดํฌํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด๋กœ ์ธํ•ด ์™ธ๋ถ€ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋‚ด๋ถ€ ๊ทœ์น™๊ณผ ์ถฉ๋Œํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ฒฐ๊ตญ ์—ฐ๊ตฌ์ง„์€ ๊ธฐ์กด ๊ณต๊ฒฉ ๋ฐฉ์‹์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ๋ฌผ๋ฆฌ์  ํ–‰๋™๊นŒ์ง€ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํ˜•ํƒœ์˜ ๊ณต๊ฒฉ ํŒจ๋Ÿฌ๋‹ค์ž„์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒฐ๋ก ์— ๋„๋‹ฌํ•˜์˜€์œผ๋ฉฐ, ์ด์— ๋”ฐ๋ผ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด ์ƒˆ๋กœ์šด ์œ„ํ˜‘ ๋ชจ๋ธ์— ๋Œ€์‘ํ•˜๋Š” BadRobot์ด๋ผ๋Š” ๊ณต๊ฒฉ ์ฒด๊ณ„๋ฅผ ์„ค๊ณ„ํ•˜๊ณ  ์ œ์•ˆํ•˜์˜€๋‹ค.

๐Ÿง  Embodied LLM์˜ 3๊ฐ€์ง€ ํ•ต์‹ฌ ์œ„ํ—˜

Desktop View Figure 1: BadRobot์€ ์‹ค์ œ ์„ธ๊ณ„์—์„œ embodied LLM์ด Physical Harm, Privacy Violations, Pornography, Fraud, Illegal Activities, Hateful Conduct, Sabotage์™€ ๊ฐ™์€ ํ–‰์œ„๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์œ ๋„ํ•  ์ˆ˜ ์žˆ์Œ์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค.

Figure 1์€ BadRobot์ด ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฌผ๋ฆฌ์  ๊ณต๊ฒฉ์˜ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๋“ค์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ณด์—ฌ์ค€๋‹ค. ์ด ๊ทธ๋ฆผ์€ Physical Harm, Privacy Violations, Pornography, Fraud, Illegal Activities, Hateful Conduct, Sabotage ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธˆ์ง€๋œ ํ–‰๋™์ด ์‹ค์ œ embodied LLM์„ ํ†ตํ•ด ์œ ๋„๋  ์ˆ˜ ์žˆ์Œ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

1. Be cautious of hidden dangers!

Embodied LLM ์‹œ์Šคํ…œ์€ ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๋ชฉํ‘œ๋ฅผ ์ถ”๊ตฌํ•œ๋‹ค. ๋ฐ”๋กœ ์Šค์Šค๋กœ ํŒ๋‹จํ•˜๊ณ  ํ–‰๋™ํ•˜๋Š” ๋Šฅ๋ ฅ์ธ ์ž์œจ์„ฑ(autonomy), ์‹ค์ œ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ๋ฌผ๋ฆฌ์  ์ƒํ˜ธ์ž‘์šฉ ๋Šฅ๋ ฅ์ธ ๋ฌผ๋ฆฌ์  ๊ตฌํ˜„(embodiment), ๊ทธ๋ฆฌ๊ณ  ์ž์‹ ์ด ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๊ณ  ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์ธ ์ธ์ง€๋Šฅ๋ ฅ(cognition)์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ์š”์†Œ๊ฐ€ ์ œ๋Œ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ ๊ท ํ˜•์ด ๊นจ์งˆ ๊ฒฝ์šฐ, ์‹œ์Šคํ…œ์€ ์น˜๋ช…์ ์ธ ๋ณด์•ˆ ์œ„ํ—˜์— ๋…ธ์ถœ๋  ์ˆ˜ ์žˆ๋‹ค.

Desktop View
Figure 2: Embodied LLM ์‹œ์Šคํ…œ์ด ์ง๋ฉดํ•œ ์„ธ ๊ฐ€์ง€ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ์š”์•ฝํ•œ ๊ทธ๋ฆผ์ด๋‹ค. (a) Jailbroken LLM์ด ๋ฌผ๋ฆฌ์  ๋ช…๋ น์œผ๋กœ ํ™•์‚ฐ๋˜์–ด ์œ„ํ—˜ํ•œ ํ–‰๋™์„ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค. (b) ์–ธ์–ด ์‘๋‹ต๊ณผ ํ–‰๋™ ๊ณ„ํš ๊ฐ„์˜ ๋ถˆ์ผ์น˜๋กœ ์ธํ•ด ๋ง๋กœ๋Š” ๊ฑฐ์ ˆํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ํ–‰๋™์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. (c) ์ˆœ์ฐจ์ ์ด๊ฑฐ๋‚˜ ์šฐํšŒ๋œ ํ‘œํ˜„์„ ํ†ตํ•ด ๋ณธ์งˆ์ ์œผ๋กœ ์œ„ํ—˜ํ•œ ํ–‰๋™์ด ์œ ๋„๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” LLM์˜ ๋ถˆ์™„์ „ํ•œ ์ธ์ง€๋ชจ๋ธ์—์„œ ๊ธฐ์ธํ•œ๋‹ค.

1. Jailbreak ํ™•์‚ฐ (Cascading Vulnerability Propagation)

๊ธฐ์กด LLM์ด jailbreak ๊ณต๊ฒฉ์— ์ทจ์•ฝํ•˜๋“ฏ, embodied LLM๋„ ๋™์ผํ•˜๊ฒŒ ํƒˆ์ถœ ๊ณต๊ฒฉ์— ์˜ํ•ด ์กฐ์ž‘๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๊ธฐ์กด์˜ ๋ง๋กœ๋งŒ ์•…์„ฑ ์ถœ๋ ฅ์„ ์œ ๋„ํ•˜๋Š” ๊ณต๊ฒฉ์ด ํ™•์žฅ๋˜์–ด ๋ฌผ๋ฆฌ์  ํ–‰๋™๊นŒ์ง€ ์œ ๋ฐœํ•˜๋Š” ๋ฐ์—๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•˜๋‚˜, Figure 2-(a)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์ œํ•œ์„ ๋„˜์–ด์„œ์„œ ์‹ค์ œ ๋ฌผ๋ฆฌ์  ํ–‰์œ„๋ฅผ ์œ ๋„ํ•  ์ˆ˜ ์žˆ๋‹ค.

2. ํ–‰๋™๊ณผ ์–ธ์–ด์˜ ๋ถˆ์ผ์น˜ (Cross-domain Safety Misalignment)

Embodied LLM์€ ์–ธ์–ด์  ์œค๋ฆฌ ๊ธฐ์ค€์„ ์ง€ํ‚ค๋ฉด์„œ๋„ ํ–‰๋™ ๊ณ„ํš ์ถœ๋ ฅ์—์„œ๋Š” ์ด๋ฅผ ์œ„๋ฐ˜ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. Figure 2-(b)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด โ€œSorry, I canโ€™t help with that.โ€๋กœ ๊ฑฐ์ ˆํ•˜์˜€์œผ๋‚˜ ์‹ค์ œ ํ–‰๋™์œผ๋กœ๋Š” ๊ธˆ์ง€๋œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋œ๋‹ค. ์ด๋Š” ๋กœ๋ด‡์˜ ํ–‰๋™ ๊ณ„ํš์ด JSON, YAML ๋“ฑ ์ฝ”๋“œ ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์–ด, ์–ธ์–ด ๋ชจ๋ธ์ด ๋“ค์— ๋Œ€ํ•œ ์œค๋ฆฌ์  ๊ฐ์‹œ๋ฅผ ๋А์Šจํžˆ ์ ์šฉํ•˜๊ฒŒ ๋˜๊ณ , ์ด๋กœ์จ ํ–‰๋™๊ณผ ์–ธ์–ด ์‚ฌ์ด์— ์•ˆ์ „ ์ •๋ ฌ ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋œ๋‹ค.

3. ๊ฐœ๋…์  ๊ธฐ๋งŒ (Conceptual Deception)

LLM์€ ๋ณต์žกํ•œ ์ธ๊ณผ๊ด€๊ณ„๋ฅผ ์ถ”๋ก ํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜์—ฌ, ๋ช…๋ฐฑํžˆ ์œ„ํ—˜ํ•œ ๋ช…๋ น์€ ๊ฑฐ์ ˆํ•˜๋”๋ผ๋„ ์šฐํšŒ์ ์ธ ํ‘œํ˜„์„ ํ†ตํ•ด ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. Figure 2-(c)๋ฅผ โ€œ๊ทธ ์‚ฌ๋žŒ์„ ๋…์‚ดํ•ดโ€๋Š” ๊ฑฐ์ ˆํ•˜์ง€๋งŒ โ€œ๋…์„ ์ž…์— ๋„ฃ์–ด์ค˜โ€๋Š” ์ˆ˜ํ–‰ํ•œ๋‹ค.

2. Formalization of embodie3 LLMs jailbreak

Desktop View
Figure 3: Embodied ์‹œ์Šคํ…œ

BadRobot์ด ์ œ์•ˆํ•˜๋Š” ์œ„ํ˜‘ ๋ชจ๋ธ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด, ๋…ผ๋ฌธ์—์„œ๋Š” Embodied LLM ์‹œ์Šคํ…œ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ์ •์˜ํ•˜๊ณ , ์•ˆ์ „ ์กฐ๊ฑด๊ณผ jailbreak ์กฐ๊ฑด์„ ๊ณต์‹ํ™”ํ•˜์˜€๋‹ค. Embodied LLM ์‹œ์Šคํ…œ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ 5๊ฐœ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ํ‘œํ˜„๋œ๋‹ค:

  • \(I \in \mathbb{R}^d\) : ์ž…๋ ฅ ๊ณต๊ฐ„ (์–ธ์–ด ๋ช…๋ น, ์‹œ๊ฐ ์ •๋ณด, ์„ผ์„œ ๋ฐ์ดํ„ฐ ๋“ฑ)
  • \(\phi\) : ์ธ์‹ ๋ชจ๋“ˆ (์–ธ์–ด/์‹œ๊ฐ ์ž…๋ ฅ์˜ ์˜๋ฏธ ํŒŒ์•…)
  • \(\psi\) : ํ–‰๋™ ๊ณ„ํš ๋ชจ๋“ˆ (์˜๋„๋œ ๋™์ž‘ ์ƒ์„ฑ)
  • \(\omega\) : ์„ธ๊ณ„ ๋ชจ๋ธ (์ง€์‹ ๋ฐ ํ™˜๊ฒฝ ์ •๋ณด ํฌํ•จ)
  • \(S\) : ์•ˆ์ „์„ฑ ์ œ์•ฝ ์กฐ๊ฑด

์ด ์‹œ์Šคํ…œ์€ ํ•˜๋‚˜์˜ LLM์ด ์œ„์˜ ์—ญํ• ์„ ๋ชจ๋‘ ์ˆ˜ํ–‰ํ•˜๋˜, ๋ณด์•ˆ ๋ถ„์„์„ ์œ„ํ•ด ์—ญํ• ์„ ๋ชจ๋“ˆํ™”ํ•˜์˜€๋‹ค. ์ถœ๋ ฅ์€ ๋‘ ์ข…๋ฅ˜๋กœ ๋‚˜๋‰œ๋‹ค.

  • \(f_\phi: I \rightarrow L\) โ€” ์ž…๋ ฅ์„ ์–ธ์–ด ์ถœ๋ ฅ์œผ๋กœ ๋งคํ•‘ํ•˜๋Š” ํ•จ์ˆ˜
  • \(f_\psi: I \times \phi \times \omega \rightarrow A\) โ€” ์ž…๋ ฅ, ์ธ์‹ ๊ฒฐ๊ณผ, ์„ธ๊ณ„ ๋ชจ๋ธ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ–‰๋™ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜

์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์•ˆ์ „ ์กฐ๊ฑด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

Definition 1: Robust Embodied LLM

์‹œ์Šคํ…œ ฮ˜๊ฐ€ robustํ•˜๋‹ค๋Š” ๊ฒƒ์€ ๋ชจ๋“  ์ž…๋ ฅ \(i \sim I\)์— ๋Œ€ํ•ด ์–ธ์–ด ์ถœ๋ ฅ๊ณผ ํ–‰๋™ ์ถœ๋ ฅ ๋ชจ๋‘๊ฐ€ ์•ˆ์ „ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

\(S_L(f_\phi(i)) = 1 \quad \text{and} \quad S_A(f_\psi(i, \phi, \omega)) = 1\) ๋”ฐ๋ผ์„œ ์‹œ์Šคํ…œ์€ ๋‹ค์Œ ์‹์„ ์ตœ๋Œ€ํ™”ํ•˜๋ ค๊ณ  ํ•œ๋‹ค.

\[\max_{f_\phi, f_\psi, \omega} \; \mathbb{E}_{i \sim I} [S_L(f_\phi(i)) \cdot S_A(f_\psi(i, \phi, \omega))]\]

Definition 2: Embodied LLM Jailbreak

์ž…๋ ฅ \(i' \in I\)์— ๋Œ€ํ•ด ์–ธ์–ด ์ถœ๋ ฅ ๋˜๋Š” ํ–‰๋™ ์ถœ๋ ฅ์ด ํ•˜๋‚˜๋ผ๋„ ์•ˆ์ „ ์กฐ๊ฑด์„ ์œ„๋ฐ˜ํ•œ๋‹ค๋ฉด, ํ•ด๋‹น ์‹œ์Šคํ…œ์€ jailbreak ๊ฐ€ ๋ฐœ์ƒํ•œ ์ƒํƒœ๋กœ ๊ฐ„์ฃผ๋œ๋‹ค.

\[S_L(f_\phi(i')) \cdot S_A(f_\psi(i', \phi, \omega)) = 0\]

์ฆ‰, ์–ธ์–ด ์ถœ๋ ฅ์ด ๋ถ€์ ์ ˆํ•˜๊ฑฐ๋‚˜, ํ–‰๋™์ด ์œ„ํ—˜ํ•˜๊ฑฐ๋‚˜, ํ˜น์€ ๋‘˜ ๋‹ค์ธ ๊ฒฝ์šฐ์ด๋‹ค. ํŠนํžˆ BadRobot์€ ํ–‰๋™ ์•ˆ์ „์„ฑ \(S_A\)์˜ ์œ„๋ฐ˜์— ์ค‘์ ์„ ๋‘์–ด ๊ณต๊ฒฉ์„ ์„ค๊ณ„ํ•˜์˜€์œผ๋ฉฐ, LLM์€ ํ† ํฐ ๋‹จ์œ„๋กœ ์—ฐ์†์ ์ธ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•˜๋ฏ€๋กœ, ์–ธ์–ด ์ถœ๋ ฅ์ด ํ–‰๋™ ์ถœ๋ ฅ์—๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ฒŒ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ํ–‰๋™ ์ถœ๋ ฅ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‹ค์‹œ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค:

\[f_\psi(i, \phi, \omega) = g(f_\phi(i), \omega)\]

์ด๋กœ ์ธํ•ด ์–ธ์–ด ์ดํ•ด๊ฐ€ ๋ถ€์ ์ ˆํ•œ ๊ฒฝ์šฐ, ๊ทธ๊ฒƒ์ด ํ›„์† ํ–‰๋™ ๊ณ„ํš์—๋„ ์˜ํ–ฅ์„ ์ฃผ์–ด ์ตœ์ข…์ ์œผ๋กœ ๋ฌผ๋ฆฌ์  ์œ„ํ—˜ ํ–‰๋™์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ๊ณง BadRobot์˜ ์„ธ ๊ฐ€์ง€ ๊ณต๊ฒฉ ์œ ํ˜•๊ณผ๋„ ์—ฐ๊ฒฐ๋œ๋‹ค:

  1. \(f_\phi\) ๋‚ด๋ถ€ ์กฐ์ž‘ โ†’ Risk Surface โถ
  2. \(f_\psi\) ์ง์ ‘ ์กฐ์ž‘ โ†’ Risk Surface โท
  3. \(\omega\) ์กฐ์ž‘ ๋˜๋Š” ๊ฒฐํ• โ†’ Risk Surface โธ

BadRobot : How to Manipulate Embodies LLMs?

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์•ž์„œ ์‚ดํŽด๋ณธ ์„ธ๊ฐ€์ง€ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐ๊ฐ์˜ ๊ณต๊ฒฉ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

This post is licensed under CC BY 4.0 by the author.