What Will Be the Next Major Breakthrough for Large Language Models?

Shi Gu (顾实)

Last updated on Aug 17, 2025

What Will Be the Next Major Breakthrough in Large Models?

In recent years, large models based on the Transformer architecture have made groundbreaking advancements in language understanding, code generation, and mathematical reasoning, gradually revealing the early stages of “general cognitive abilities.” The younger generation (companies like OpenAI, Anthropic, etc.) firmly believes that Scaling Law + Reinforcement Learning is the key to AGI, while senior scholars (such as Yann LeCun, Hinton, etc.) tend to think that we still need more refined and interpretable world models to elevate AI to the next level. While the latter viewpoint may seem reasonable, the substantial breakthroughs have mainly been driven by the former. The core reason for this is that those holding the latter view are often far removed from the front lines of AGI research, lacking the top teams needed to realize their vision. Their definition of “complex tasks” is not as challenging as the tasks handled by the former group, and they also lack viable paths to transform ideas into rules. As a result, we currently do not have a second option. Returning to the first path, from the perspective of next-token prediction, language models are complete; so what is the biggest flaw in current large models? Is there a way to overcome this limitation?

Let’s step away from a specific test question and consider the following three representative constructive problems:

Mathematics: Is it possible to reconstruct group theory without the Galois or Abel theories?
Physics: Based on the physical observations and mathematical tools available before 1900, is it possible to derive quantum mechanics or relativity?
Computation: Can modern computer architecture and programming systems be designed solely based on physical and engineering principles?

Undoubtedly, these three problems exceed the capabilities of current large models. The training architecture of today’s large models determines that their strength lies in “structural interpolation,” while the generation of new human knowledge occurs by discovering innovative paths in unseen combination spaces—this is the most challenging ability in the development of human science and also the breakthrough that could truly advance civilization if achieved in large models. We believe that AutoResearch—the realization and evaluation mechanism for autonomous scientific construction capabilities—is the next key milestone.

Some may ask: since large models can solve IMO-level math problems, why do they still have limitations in this area? In tasks like mathematics and code, the goal often belongs to “interpolation” problems, rather than true “extrapolation.” These types of problems align naturally with the modeling assumptions of large models, so progress is predictable and manageable—even for problems as difficult as those in the IMO, as long as the probability of sampling the correct path in each step is not low, the overall search space remains manageable. For example, in solving this year’s IMO problems, Gemini and GPT found the correct approach within 10 iterations for the first five problems, while human participants may not be able to conduct such a broad search. However, truly insightful geniuses do not think through exhaustive search and sampling. From a principle standpoint, current pre-trained models can only best fit the known data distribution, while post-training (such as RLHF and various PO methods) introduces preference modeling and state-space search, providing initial task adaptation capabilities. However, real scientific construction paths are often extremely sparse in the training corpus. In other words, the training distribution may contain necessary fragmentary knowledge, but almost no complete combinations. Regardless of the optimization strategy used, the premise is that the system must be able to sample the key paths. Therefore, to break through the creative boundaries, we still need:

Task construction: Propose theoretical construction tasks with high structural sparsity but verifiable outcomes.
Sample strategy: Build a controllable sample→validation→reward loop.
Preference resetting: Change the sampling preferences through post-training to reinforce low-frequency, high-value combinations.

This means we need to reset the sampling preferences and frequency distributions of samples during the post-training stage, so that the model is more inclined to explore low-frequency but high-value combinations. Based on this, we can build a scalable post-training framework to systematically enhance the creativity of large models. Pre-training can only learn the joint distribution of existing texts to the greatest extent, while for “combinations not seen in the corpus,” we still need to rely on sampling and reinforcement learning mechanisms to complete the search and optimization. Even though the incorporation of reinforcement learning strategies significantly enhances state-space search capabilities, providing the model with stronger exploratory abilities in more creative tasks, the intrinsic definition of “research value” and “research significance” remains a key challenge to overcome: even for humans, there is no consistent judgment about what constitutes valuable work. However, some works’ value has already been validated by history, and that is our best data label.

We can advance related research from the following aspects:

Systematically summarize human scientific research and objective comparisons of scientific research to create paradigms for large-scale training and learning.
Build controlled sample spaces and knowledge-masking experiments to make evaluations of innovation possible, rather than just interpolation.
Construct a theoretical reconstruction benchmark to test the large model’s ability to simulate and reconstruct key scientific paths.
Promote the systematic training of creative sample generation mechanisms, so that models transition from generating plausible text to exploring new combinations with potential scientific value.
Provide theoretical support and technical pathways for the “endogenous scientific mechanism module” in AGI architectures.

To achieve such a path, top-tier researchers with good scientific taste and top AI engineers need to collaborate closely to systematically build this system. Simple workflows and agent research merely continue empirical fitting based on the current interpolation strategies. To be more radical, all existing agent explorations lack essential breakthrough significance. As mentioned earlier, the core difficulty faced by agents is identical to that faced by large models: if the current process can be achieved through reinforcement learning, then enhancing the large model itself is enough; if it cannot, the problem lies in sampling, and breakthroughs cannot be made through combinations of workflows and specialization. This determines that optimization of workflows through agents cannot lead to a breakthrough of the bottleneck. Therefore, the effectiveness of agents depends on the information flow, not on enhancing intelligence capabilities through agents. All research on agents will not contribute substantially to the advancement of the intelligence capabilities of large models themselves.

So, will current AI for Science achieve substantial scientific breakthroughs? To be radical, the only valuable result achieved by AI for Science so far is AlphaFold, which essentially focuses on local sampling of the folding process and energy function optimization. Based on the data richness and mechanistic consistency of protein folding research itself, the problem AI needs to solve boils down to sampling and optimization, hence making progress. For applications in other fields, we must first ask why AI is suitable for solving the task. If the discovery of new knowledge or mechanisms is too complex, then such tasks are not suitable for AI to solve. Therefore, if we want to promote AI for Science research, generalizing the entire field’s progress is equivalent to improving AI’s cognitive capabilities, without any inherent correlation to the specific form of science. If it is a specific research problem, then it should be judged according to the previous criteria. So, in summary, current AI for Science is about using AI for local optimization and will not bring about essential progress from 0 to 1. True breakthroughs in AI for Science still depend on breakthroughs in AI Science itself, and relying on “for” will only bring quantitative changes.

Thus, we boldly assert here that AutoResearch is a stepping stone to AGI. When this step is truly broken through, artificial intelligence will have embarked on its path of self-evolution, and that moment will also mark the beginning of a new chapter in human civilization.

What Will Be the Next Major Breakthrough for Large Language Models?

What Will Be the Next Major Breakthrough in Large Models?

Shi Gu (顾实)

Tenured Associate Professor of Computer Science