Neel Nanda discusses how interpretability can aid in training and aligning AI models, focusing on improving training data and influencing training without altering the data directly. He references the CAFT paper and explores methods like steering vectors and ablation to control model behavior, particularly in scenarios like preventing models from learning undesirable concepts. Nanda also touches on using training data attribution to filter out problematic data points, especially in reinforcement learning, and considers the potential of using probes or LLMs to assess and filter transcripts for RL training, weighing the trade-offs between cost, trustworthiness, and effectiveness.
Sign in to continue reading, translating and more.
Continue