nger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively fine-tune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive fine-tuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work.

We find that simple methods can often significantly improve weak-to-strong generalization: for example, when fine-tuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

Source:
https://arxiv.org/pdf/2312.09390.pdf

Narrated for AI Safety Fundamentals by Perrin Walker

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

11:["$","$L20",null,{"data":{"isPreview":true,"seq":6154105,"episode":{"Id":"7c8838063aa99770faccd4e9b197d9e3335da47aa7aabf6aab934b87d78108e8","Seq":6154105,"PodId":"3e542365ff1426ffa58325d43a34e8a37648135facd5ea3e737a74fc913657ef","PodSeq":21115,"Title":"Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision","PodName":"AI Safety Fundamentals","Description":"$21","Url":"","Link":"https://www.buzzsprout.com/2334908/episodes/16381818-weak-to-strong-generalization-eliciting-strong-capabilities-with-weak-supervision.mp3","LinkType":"mp3","PublishTime":"$D2025-01-04T22:00:00.000Z","Img":"ht