Testing an LM system for dangerous capabilities is crucial for assessing its risks.

Summary of best practices

Best practices for labs evaluating LM systems for dangerous capabilities:

Publish results
Publish questions/tasks/methodology (unless that's dangerous, e.g. CBRN evals; if so, just publish a small subset and offer to share more information with other labs, government, and relevant auditors)
Do good elicitation and publish details (or at least demonstrate that your elicitation is good):
- General finetuning (for "instruction following, tool use, and general agency" and maybe capabilities in the relevant area)
- Helpful-only; no inference-time mitigations
- Scaffolding, prompting, chain of thought
  - The lab should mention some details so that observers can understand how powerful and optimized the scaffolding is. Open-sourcing scaffolding or sharing techniques is supererogatory. If the lab does not share its scaffolding, it should show that the scaffolding is effective by running the same model with the [...]

---

Outline:

(00:11) Summary of best practices

(04:26) How labs are doing

(06:40) Appendix: misc notes on best practices

(06:49) Appendix: misc notes on particular evals

(12:08) Appendix: reading list

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
September 23rd, 2024

Source:
https://www.lesswrong.com/posts/x2yFrppX7RGz59LZF/model-evals-for-dangerous-capabilities

---

“Model evals for dangerous capabilities ” by Zach Stein-Perlman

LessWrong (30+ Karma)