Components of RLHF systems
A typical RLHF system for LLMs consists of three main components:
- Base language model: The pre-trained LLM to be fine-tuned
- Reward model: A model trained on human preferences to provide feedback
- Policy optimization: The process of updating the base model using the reward signal
The base language model serves as the starting point. This is the general-purpose large language model that has already undergone extensive pre-training on large-scale corpora using self-supervised objectives such as next-token prediction. At this stage, the model is capable of generating coherent language and demonstrating broad linguistic competence. However, it lacks alignment with human preferences, task-specific objectives, or context-dependent behavior expected in real-world deployment. This pre-trained model is the substrate upon which subsequent tuning is performed. Its architecture, training regime, and scaling have already been well-documented in literature...