12:["$","$L21",null,{"data":{"isPreview":true,"seq":7374979,"episode":{"Id":"32276dc4af6d5f981f8ba4a4309de79e4670a195487770e9597781cce61e67aa","Seq":7374979,"PodId":"c2d6b50707f47c5b2af65a35314bc77065b579cc615d7f559bf53717cbc4938f","PodSeq":24594,"Title":"Multi-Turn Reinforcement Learning from Human Preference Feedback","PodName":"Best AI papers explained","Description":"

This academic paper introduces Multi-turn Preference Optimization (MTPO), a novel approach to Reinforcement Learning from Human Feedback (RLHF) for Large Language Models (LLMs). Unlike existing RLHF methods that evaluate single conversational turns, MTPO focuses on multi-turn interactions, where feedback is provided for entire conversations to capture long-term goals and planning. The paper presents theoretical guarantees for MTPO's convergence to a Nash equilibrium in a multi-turn preference-based RL problem. Experimental results in a new "Education Dialogue" environment demonstrate that MTPO and its variant, MTPO-τ, outperform single-turn baselines and traditional multi-turn RLHF in aligning LLMs with human preferences, even when relying on a weaker preference signal compared to explicit rewards.

\n","Url":"https://podcasters.spotify.com/pod/show/ehwkang/episodes/Multi-Turn-Reinforcement-Learning-from-Human-Preference-Feedback-e35br0o","Link":"https://anchor.fm/s/1026675f8/podcast/play/105294296/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2025-6-10%2F403646810-44100-2-2248a2c8efee9.m4a","LinkType":"m4a","PublishTime":"$D2025-07-10T01:37:51.000Z","Img":"https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43252366/43252366-1744500070152-e62b760188d8.jpg","EpImg":"https://d3t3ozftmdmh3i.cloudfront.net/staging/podcast_uploaded_nologo/43252366/43252366-1744500070152-e62b760188d8.jpg","Duration":"00:17:06","Language":null,"SampleDuration":null,"IsVBR":false,"Transcribed":false,"Indexed":1,"Deleted":false,"RedirectSeq":null,"Source":null,"Size":null},"prevAndNext":{"prevSeq":7374978,"nextSeq":7374980},"states":{"state":"not-login","extra":{"summary":"Best AI papers explained - Multi-Turn Reinforcement Learning from Human Preference Feedback","previewContent":{"summary":"Best AI papers explained - Multi-Turn Reinforcement Learning from Human Preference Feedback","chapters":[],"keywords":[],"highlights":[],"transcripts":[]}}}}}]