Iterative Data Smoothing: Mitigating Reward Overfitting and Overoptimization in RLHF | Best AI papers explained | Podwise