In the realm of machine learning and artificial intelligence, the availability of open weights is a significant step forward, yet it is merely the beginning. To truly propel open research in these fields, we require open training frameworks that do more than just execute jobs. These frameworks must provide visibility, understandability, and modifiability in the training process, allowing researchers, engineers, and practitioners to construct new algorithms without struggling against opaque systems.

This need inspired the creation of FeynRL (pronounced 'FineRL'), a framework tailored for Reinforcement Learning (RL) post-training of Language Models (LLMs), Vision Language Models (VLMs), and agents. The complexity of RL is already a challenge, and when applied to LLMs and VLMs, it only becomes more intricate. Factors such as rollout engines, reward computation, distributed training, weight synchronization, credit assignment, and long-horizon behaviors contribute to a myriad of small implementation details that can silently sabotage progress.

FeynRL is designed with a straightforward core principle: algorithms should remain distinct from the systems they operate on, enabling researchers and practitioners to grasp the complete training loop from start to finish without extensive time investment. The framework is explicit in its design, covering every aspect from data loading and rollout generation to reward computation, loss construction, optimization, and evaluation.

By facilitating the development of new algorithms, training recipes, reward designs, rollout strategies, and optimization methods, FeynRL eliminates the frustration of navigating convoluted and hidden systems. Currently, it includes examples for Supervised Fine-Tuning (SFT), Decision Process Optimization (DPO), and RL-style post-training for both VLM and LLM setups, supporting configurations ranging from single-GPU to multi-GPU and cluster environments.

Feedback is welcomed, and the community is encouraged to share thoughts on which aspects of RL post-training infrastructure remain too obscure, difficult to debug, or challenging to modify.