to make personalized predictions. The core innovation lies in the online co-adaptation loop, where both the user-summarization model and the reward model are trained simultaneously, resulting in significant improvements in reward model accuracy, particularly when dealing with heterogeneous preferences and new users. Empirical results demonstrate that PLUS is more robust and achieves zero-shot personalization on state-of-the-art proprietary models like GPT-4, achieving a 72% win rate against unpersonalized responses. The framework offers enhanced transparency and interpretability by representing user preferences in human-readable text summaries.

12:["$","$L21",null,{"data":{"isPreview":true,"seq":7374893,"episode":{"Id":"7e5a8bc53d9cbdab45757882e73641c26fd7813bfb47b79ed7eb17bfafedc081","Seq":7374893,"PodId":"c2d6b50707f47c5b2af65a35314bc77065b579cc615d7f559bf53717cbc4938f","PodSeq":24594,"Title":"Learning to summarize user information for personalized reinforcement learning from human feedback","PodName":"Best AI papers explained","Description":"$22","Url":"https://podcasters.spotify.com/pod/show/ehwkang/episodes/Learning-to-summarize-user-information-for-personalized-reinforcement-learning-from-human-feedback-e392p0g","Link":"https://anchor.fm/s/1026675f8/podcast/play/109191632/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2025-9-4%2F33a4ae1c-4a8a-7a53-24b5-f9175321a02b.m4a","LinkType":"m4a","PublishTime":"$D2025-10-04T01:39:53.000Z","Img":"https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43252366/43252366-1744500070152-e62b760188d8.jpg","EpImg":"https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43252366/43252366-1744500070152-e62b760188d8.jpg","Duration":"00:16:00","Language":null,"SampleDuration":null,"IsVBR":false,"Transcribed":false,"Indexed":1,"Deleted":false,"RedirectSeq":null,"Source":null,"Size":null},"prevAndNext":{"prevSeq":7374892,"nextSeq":7374894},"states":{"state":"not-l