
PreprintAdaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RLPreprintHTML
PreprintReinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM TrainingPreprintPDFCode
PreprintBeyond Correctness: Harmonizing Process and Outcome Rewards through RL TrainingPreprintPDF
PreprintSelf-Rewarding Correction for Mathematical ReasoningPreprintPDF
NeurIPS 2024Online Iterative Reinforcement Learning from Human Feedback with General Preference ModelNeurIPS 2024PDF
ICML 2024Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-ConstraintICML 2024PDF
Research Intern in Amazon Customer Service Team (2025.05 - present).
Topic: LLM reasoning and agentic RL
Hosts: Zhou Yu, Ziji Zhang, Anurag Beniwal
Ph.D. in Computer Science, University of Illinois Urbana-Champaign (2024.08 - present).
Advisor: Prof. Tong Zhang.
Visiting Scholar, University of California, Los Angeles (2023.08 - 2023.12).
Host: Prof. Quanquan Gu.
MPhil in IIP (AI), The Hong Kong University of Science and Technology (2021.09 - 2024.08).
Advisor: Prof. Tong Zhang.
B.S. in Statistics, University of Science and Technology of China