Chenlu Ye

picture 

Ph.D. student
University of Illinois Urbana-Champaign, Computer Science

[Curriculum Vitae]      [Google Scholar]

I am a second-year Ph.D. student in computer science at UIUC, where I am fortunate to advised by Prof. Tong Zhang. Prior to this, I obtained a master's degree in IIP (AI) at The Hong Kong University of Science and Technology, and received a B.S. in Statistics from the University of Science and Technology of China in 2021. Additionally, I was a visiting scholar in AGI LAB @ UCLA from 2023.08 to 2023.12, working with Prof. Quanquan Gu.

Research Interests

My research interests span the intersection of reinforcement learning for LLM post-training and decision making problems, with a particular emphasis on reasoning in post training of llm and muiti-turn agentic RL.

If you are interested in discussing or collaborating, please feel free to contact me via email: chenluy3 AT illinois DOT edu.

Publications and Preprints

(*) denotes alphabetical order or equal contribution.

  1. Reinforcement Learning for Reasoning and Post-Training

    1. Reinforce-Ada: An Adaptive Sampling Framework for Reinforce-Style LLM Training
      Wei Xiong*, Chenlu Ye*, Baohao Liao*, Hanze Dong*, Xinxing Xu, Christof Monz, Jiang Bian, Nan Jiang, Tong Zhang, Preprint.
      Developed an textbf{adaptive-sampling} framework that dynamically allocates inference budget across prompts for online RL post-training to avoid signal elimination and increase signal diversity.

    2. Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training
      Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang,Anurag Beniwal, Preprint.
      Proposed textbf{PRocess cOnsistency Filtering (PROF)} to robustly integrate noisy Process Reward Models (PRMs) with Outcome Reward Models (ORMs) in RL, incorporating data consistency filtration and balancing the correct-incorrect ratio, which not only increases the final outcome accuracy but also shapes the intermediate reasoning steps and improves the process reasoning quality.

    3. Self-rewarding correction for mathematical reasoning
      Wei Xiong*, Hanning Zhang*, Chenlu Ye*, Lichang Chen, Nan Jiang, Tong Zhang, Preprint.
      Proposed a self-rewarding correction framework to enhance the policy model's ability to perform self-verification and correction for mathematical reasoning.

    4. Online iterative reinforcement learning from human feedback with general preference model
      Chenlu Ye*, Wei Xiong*, Yuheng Zhang*, Hanze Dong*, Nan Jiang, Tong Zhang, NeurIPS 2024.
      We study general preference without assuming the Bradley–Terry model. We propose sample efficient algorithms for both online and offline settings and validate their efficiency theoretically and empirically.

    5. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint
      Wei Xiong*, Hanze Dong*, Chenlu Ye*, Han Zhong, Nan Jiang, Tong Zhang, ICML 2024.
      We formulate the real-world RLHF process as a reverse-KL regularized contextual bandit and study its theoretical property by proposing statistically efficient algorithms with a finite-sample theoretical guarantee. We also connect our theoretical findings with practical algorithms (e.g. DPO, RSO), offering new tools and insights for the algorithmic design of alignment algorithms.

  2. Theory of decision making porblems

    1. Logarithmic Regret for Online KL-Regularized Reinforcement Learning
      Heyang Zhao*, Chenlu Ye*, Wei Xiong, Quanquan Gu, Tong Zhang, ICML 2025.
      We show that KL-regularized reinforcement learning with online exploration enjoy logarithmic regret bound.

    2. Sharp Analysis for KL-Regularized Contextual Bandits and RLHF
      Heyang Zhao, Chenlu Ye, Wei Xiong, Quanquan Gu, Tong Zhang, NeurIPS 2025.
      We prove sharp sample complexity for KL-regularized contextual bandits and reinforcement learning from human feedback.

    3. Catoni Contextual Bandits are Robust to Heavy-tailed Rewards
      Chenlu Ye*, Yujia Jin, Alekh Agarwal, Tong Zhang, , Spotlight of ICML 2025.
      We build online contectual bandits based on Catoni's estimator from robust statistics under general function approximation and show that the regret bound depends logarithmically on the reward range for both known and unkown reward variances.

    4. Towards robust model-based reinforcement learning against adversarial corruption
      Chenlu Ye*, Jiafan He*, Quanquan Gu, Tong Zhang, ICML 2024.
      An analysis of uncertainty-aware algorithms in the model-based framework under adversarial corruption and general function approximation.

    5. Corruption-Robust Offline Reinforcement Learning with General Function Approximation
      Chenlu Ye*, Rui Yang*, Quanquan Gu and Tong Zhang, NeurIPS 2023.
      An application of the uncertainty-weighting technique in offline reinforcement learning problems under adversarial corruption and general function approximation. Moreover, practical implementations under various data-corruption scenarios are carried out on the uncertainty-weighting algorithm, which outperforms the state-of-the-art.

    6. Corruption-Robust Algorithms with Uncertainty Weighting for Nonlinear Contextual Bandits and Markov Decision Processes
      Chenlu Ye, Wei Xiong, Quanquan Gu and Tong Zhang, ICML 2023.
      An application of uncertainty-weighted regression in the face of adversarial corruptions and under general function approximation: new weight design, and new techniques for controlling the sum of the (weighted) bonus (counterpart of the elliptical potential lemmas).

    7. Provably Efficient High-Dimensional Bandit Learning with Batched Feedbacks
      Jianqing Fan*, Zhaoran Wang*, Zhuoran Yang*, Chenlu Ye* (Alphabetical), Preprint.
      A batching framework for high-dimensional multi-armed bandit problems, with simulations on both synthetic and real-world data.

  3. Active Learning

    1. Optimal Sample Selection Through Uncertainty Estimation and Its Application in Deep Learning
      Yong Lin*, Chen Liu*, Chenlu Ye*, Qing Lian, Yuan Yao and Tong Zhang, JMLR.
      A Theoretically optimal and computationally efficient sample selection approach, which can be effectively applied to deep learning and is robust to misspecification (by down-weighting highly uncertain samples).