摘要:
Mathematical reasoning has become a central target of LLM post-training, where data quality is not only about clean problem–solution pairs, but about reliable and verifiable training signals. This talk reviews the evolution of mathematical post-training data: from human-curated solutions and synthetic reasoning traces, to answer-level verification, rejection sampling, and process reward models.
We then discuss recent reinforcement learning with verifiable rewards, where the key data unit shifts from a static problem–solution pair to a query–verifier pair. In this setting, high-quality data should be correct, learnable, sufficiently challenging, diverse, and automatically verifiable. The talk concludes with open challenges, including verifier reliability, process-level errors, difficulty selection, synthetic-data noise, and benchmark contamination.
论坛简介:该线上论坛是由张志华教授机器学习实验室组织,每两周主办一次(除了公共假期)。论坛每次邀请一位博士生就某个前沿课题做较为系统深入的介绍,主题包括但不限于机器学习、高维统计学、运筹优化和理论计算机科学。