INSAIT Tech Series: Prof. Zico Kolter - AI Safety and Robustness: Recent Advances
Abstract
In order to prevent undesirable outputs, most large language models (LLMs) have built-in “guardrails” that enforce policies specified by the developers, for example, that LLMs should not produce output deemed harmful. Unfortunately, using adversarial attacks on such models, it has been possible to circumvent these safeguards, allowing bad actors to manipulate LLMs for unintended purposes. Historically, such adversarial attacks have been extremely hard to prevent. However, in this talk I will highlight several recent advances that have substantially improved the practical robustness of LLMs. This work has culminated in a recent competition where attackers were unable to break an LLM we have deployed after a month of attempts. I’ll highlight the current state and challenges in the field, and discuss the future of safe AI systems.