(Purely human written, then translated by ChatGPT.)

I have always had some difficulties with communication and expression. There are many things in my mind that I cannot quite put into words, which is why I have always resisted writing. Whenever I write, the result often feels messy, scattered, and lacking in logic. But perhaps this is not entirely my fault. Human language itself was designed for the majority, and not everyone is neurotypical. Even if my writing is chaotic, even if many things remain inexpressible, some words will eventually click with some people.

Late at night, half asleep and half awake, I suddenly realized how to solve a problem I had failed during an interview the day before. I had not had this kind of Eureka moment for a long time. When I was an undergraduate, this used to happen often: after an exam, at some completely unexpected moment, I would suddenly understand how to solve a problem I had missed, and I would feel an immense sense of joy. But later, as I entered research and began my PhD, the objectives and rewards became increasingly vague. Little by little, the joy that learning once brought me disappeared. For the past four or five years, I have hardly had the time to sit down and learn quietly. I also forgot how to design rewards and do optimization for myself.

Compared with research, exams in school had a much clearer and more immediate reward structure. In order for most students to achieve reasonably good grades, the evaluation set was often not too different from the training set. I did not even need RL. Simple SFT on ground-truth answers was already enough for me to get a good score.

Research, however, is much trickier. First, I do not know what the reward function is. I can only design a proxy reward using heuristics, optimize it, and then gradually adjust the reward function itself. This creates two problems. The first is reward hacking: I may achieve good-looking metrics, while the actual performance in the real world remains poor. Yet I cannot simply avoid optimizing the reward. Optimization is almost an instinct of human beings. Every problem can be turned into an optimization problem; the only differences are the objective and the constraints. The second problem is that the true reward I receive is extremely noisy, so I do not know whether the reward function I defined is actually a good one.

Recently, while preparing for interviews, I developed some new intuitions about the training pipeline of LLMs. Since I had not taken exams for years, I had forgotten many of the fundamentals, or only retained a vague impression of them. So when I tried to solve problems directly and realized that I could not do them at all, I felt extremely frustrated. Even after spending a lot of time, I still could not solve them. This felt like doing RLVR directly on a very weak base model.

At that point, SFT became necessary. I needed to first look at the correct answers, review the relevant knowledge, and memorize or imitate the right solutions. However, merely looking at the correct answers was not enough to help me achieve a high score. In particular, because my time was limited, my number of SFT steps was also limited. I did not even overfit; I could not even memorize the correct answers.

Then came RL: putting the answers away and trying to write code and solve the problems by myself. Only then did I discover many problems that never appeared when I was simply “copying from the answer.” This is very similar to LLM training. With SFT alone, we only learn how to predict the next token given the correct previous tokens. But if we need to solve the problem from scratch, there are far too many possible trajectories, and far too many places where things can go wrong. During inference, when there is no correct answer to refer to at every step, even a small mistake somewhere in the middle can cause the final result to be wrong. But with RL, we are forced to explore different paths. Sometimes we make new discoveries, and through this process, we gradually learn to generalize.

These reflections have given me more passion for the work I am currently doing. To be honest, I have never felt a strong attachment to computer science as a discipline. Compared with discrete things, I have always preferred continuous ones. Although I am pursuing a PhD in CS, what I actually chose was ML and AI. Compared with computers, I am more interested in studying things related to human beings, such as psychology and philosophy. These are things I can “experiment” with directly in life. Although I have never studied them systematically, life itself is the best classroom.

The emergence of LLMs has completely mitigated the gap between “what I want to do” and “what I am doing.” As a person, I feel that one eternal mission is to figure out your own life. Curiosity about life is such a beautiful thing. No matter what I do, I hope it is centered around human beings. When I find traces of life in my work, I feel that I am not merely working; I am exploring life itself.

Many things are abstractions of life. Gradient descent, constrained optimization, and RL all reflect life in their own ways. These ideas attract me with a unique kind of beauty.