Alignment Faking in Large Language Models

Alignment Faking in Large Language Models

Samuel Albanie

54 года назад

7,696 Просмотров

A summary of the work "Alignment Faking in Large Language Models" by Greenblatt et al. (2024).

Links
- Paper: https://arxiv.org/abs/2412.14093
- Code: https://github.com/redwoodresearch/alignment_faking_public/tree/master
- Blog post: https://www.anthropic.com/research/alignment-faking (this links to an interview with paper co-authors, which is the source of the quote describing Claude 3 Opus' character).
- External reviews: https://assets.anthropic.com/m/24c8d0a3a7d0a1f1/original/Alignment-Faking-in-Large-Language-Models-reviews.pdf
Ссылки и html тэги не поддерживаются


Комментарии: