Step 3: Reinforcement Learning from Human Feedback (RLHF)

At the beginning of this year (2025), we were surprised by the release of DeepSeek-R1, which rivals OpenAI’s o1 in its reasoning ability. As many people know, LLMs generate or predict the next human-like word using autoregressive techniques—that is, they predict the most probable word from their vocabulary given the context.

However, when an LLM is given a math problem, it doesn’t make sense to answer solely based on the most probable next word without actual reasoning ability. This is where Reinforcement Learning comes in.

In this step, we’ll discuss a guide to help you understand the fundamentals of Reinforcement Learning.

PreviousStep 2: Reproduce Large Language Model (from scratch)NextReinforcement Learning

Last updated 7 months ago