Johannes Ackermann

prof_pic.jpg

I am on the job market for postdocs or industry positions starting Fall 2026 or Spring 2027, please reach out if you have an opening!

I am a final-year PhD student at the University of Tokyo, working on Reinforcement Learning supervised by Masashi Sugiyama, and a part-time researcher at RIKEN AIP.

I previously interned at Sakana AI and at Preferred Networks. Prior to starting my PhD, I worked on applied ML for Optical Communication at Huawei, obtained a B.Sc. and M.Sc. in Electrical Engineering and Information Technology from the Technical University of Munich and wrote my Master’s Thesis at ETH Zurich’s Disco Group.

I am particularly interested in the nature of “tasks” in RL, here defined as the combination of transition and reward function:

  • Task Specification: In LLM post-training tasks are often specified indirectly via reward models or with other LLMs as judges. We showed that reward models learned from human preferences (RLHF) need off-policy corrections [COLM1]. I am also particularly excited about our recent preprint [arXiv1], which shows that gradient regularization effectively prevents reward hacking in both RLHF and LLM-as-a-Judge, showing how we can deal with poory specified rewards/tasks! I also (co-)investigated different ways to accumulate rewards, beyond the simple discounted sum [RLC3], such as range, min, max, variance, etc.

  • Changing Tasks: How can we deal with changing tasks during dataset collection for Offline RL [RLC1] or changing dynamics shift during deployment [RLC2]?

  • Structure of Tasks: In Multi-Task RL, all tasks are usually treated as equally (dis)similar. I investigated how to identify and use task relations, by learning continuous task spaces [Thesis, Chapter 3] and task clusterings [ECML-PKDD1].

I’m always happy to chat about research, so feel free to reach out by e-mail or socials!

news

Jul 08, 2025 New preprint showing that Gradient Regularization prevents Reward Hacking in RLHF/RLVR
Jul 08, 2025 Off-Policy Corrected Reward Modeling for RLHF has been accepted at COLM 2025 :tada:
May 10, 2025 Two papers 1, 2 accepted at RLC 2025 :tada:

latest posts

selected publications

  1. gradregrl.png
    Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
    Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, and Masashi Sugiyama
    Feb 2026
  2. Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback
    Johannes Ackermann, Takashi Ishida, and Masashi Sugiyama
    In Conference on Language Modeling (COLM) 2025 , Oct 2025
  3. RLC
    recursive_reward_aggregation.png
    Recursive Reward Aggregation
    Yuting Tang, Yivan Zhang, Johannes Ackermann, Yu-Jie Zhang, Soichiro Nishimori, and Masashi Sugiyama
    In Reinforcement Learning Conference (RLC) 2025 , Aug 2025
  4. RLC
    pu_offlinerl.png
    Offline Reinforcement Learning with Domain-Unlabeled Data
    Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, and Masashi Sugiyama
    In Reinforcement Learning Conference (RLC) 2025 , Aug 2025
  5. RLC
    nonstationary_dataset.png
    Offline Reinforcement Learning from Datasets with Structured Non-Stationarity
    Johannes Ackermann, Takayuki Osa, and Masashi Sugiyama
    In Reinforcement Learning Conference (RLC) 2024 , Aug 2024
  6. High-Resolution Image Editing via Multi-Stage Blended Diffusion
    Johannes Ackermann, and Minjun Li
    In NeurIPS Machine Learning for Creativity and Design Workshop 2022 , Dec 2022
  7. Unsupervised Task Clustering for Multi-Task Reinforcement Learning
    Johannes Ackermann, Oliver Richter, and Roger Wattenhofer
    In ECML-PKDD 2021 , Sep 2021
  8. Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics
    Johannes Ackermann, Volker Gabler, Takayuki Osa, and Masashi Sugiyama
    In Deep Reinforcement Learning Workshop at NeurIPS , Dec 2019