What Will Be the Next Major Breakthrough for Large Language Models?

Shi Gu, Published on July 29, 2025

In recent years, large models based on the Transformer architecture have made groundbreaking advancements in language understanding, code generation, and mathematical reasoning, gradually revealing the early stages of "general cognitive abilities." The younger generation (companies like OpenAI, Anthropic, etc.) firmly believes that Scaling Law + Reinforcement Learning is the key to AGI, while senior scholars (such as Yann LeCun, Hinton, etc.) tend to think that we still need more refined and interpretable world models to elevate AI to the next level. While the latter viewpoint may seem reasonable, the substantial breakthroughs have mainly been driven by the former. The core reason for this is that those holding the latter view are often far removed from the front lines of AGI research, lacking the top teams needed to realize their vision. Their definition of "complex tasks" is not as challenging as the tasks handled by the former group, and they also lack viable paths to transform ideas into rules. As a result, we currently do not have a second option. Returning to the first path, from the perspective of next-token prediction, language models are complete; so what is the biggest flaw in current large models? Is there a way to overcome this limitation?

Let’s step away from a specific test question and consider the following three representative constructive problems:

Mathematics: Is it possible to reconstruct group theory without the Galois or Abel theories?
Physics: Based on the physical observations and mathematical tools available before 1900, is it possible to derive quantum mechanics or relativity?
Computation: Can modern computer architecture and programming systems be designed solely based on physical and engineering principles?

Undoubtedly, these three problems exceed the capabilities of current large models. The training architecture of today's large models determines that their strength lies in "structural interpolation," while the generation of new human knowledge occurs by discovering innovative paths in unseen combination spaces—this is the most challenging ability in the development of human science and also the breakthrough that could truly advance civilization if achieved in large models. We believe that AutoResearch—the realization and evaluation mechanism for autonomous scientific construction capabilities—is the next key milestone.

Some may ask: since large models can solve IMO-level math problems, why do they still have limitations in this area? In tasks like mathematics and code, the goal often belongs to "interpolation" problems, rather than true "extrapolation." These types of problems align naturally with the modeling assumptions of large models, so progress is predictable and manageable—even for problems as difficult as those in the IMO, as long as the probability of sampling the correct path in each step is not low, the overall search space remains manageable. For example, in solving this year’s IMO problems, Gemini and GPT found the correct approach within 10 iterations for the first five problems, while human participants may not be able to conduct such a broad search. However, truly insightful geniuses do not think through exhaustive search and sampling. From a principle standpoint, current pre-trained models can only best fit the known data distribution, while post-training (such as RLHF and various PO methods) introduces preference modeling and state-space search, providing initial task adaptation capabilities. However, real scientific construction paths are often extremely sparse in the training corpus. In other words, the training distribution may contain necessary fragmentary knowledge, but almost no complete combinations. Regardless of the optimization strategy used, the premise is that the system must be able to sample the key paths. Therefore, to break through the creative boundaries, we still need:

Task construction: Propose theoretical construction tasks with high structural sparsity but verifiable outcomes.
Sample strategy: Build a controllable sample→validation→reward loop.
Preference resetting: Change the sampling preferences through post-training to reinforce low-frequency, high-value combinations.

We can advance related research from the following aspects:

Systematically summarize human scientific research and objective comparisons of scientific research to create paradigms for large-scale training and learning.
Build controlled sample spaces and knowledge-masking experiments to make evaluations of innovation possible, rather than just interpolation.
Construct a theoretical reconstruction benchmark to test the large model’s ability to simulate and reconstruct key scientific paths.
Promote the systematic training of creative sample generation mechanisms, so that models transition from generating plausible text to exploring new combinations with potential scientific value.
Provide theoretical support and technical pathways for the "endogenous scientific mechanism module" in AGI architectures.

To achieve such a path, top-tier researchers with good scientific taste and top AI engineers need to collaborate closely to systematically build this system. Simple workflows and agent research merely continue empirical fitting based on the current interpolation strategies. To be more radical, all existing agent explorations lack essential breakthrough significance. As mentioned earlier, the core difficulty faced by agents is identical to that faced by large models: if the current process can be achieved through reinforcement learning, then enhancing the large model itself is enough; if it cannot, the problem lies in sampling, and breakthroughs cannot be made through combinations of workflows and specialization. This determines that optimization of workflows through agents cannot lead to a breakthrough of the bottleneck. Therefore, the effectiveness of agents depends on the information flow, not on enhancing intelligence capabilities through agents. All research on agents will not contribute substantially to the advancement of the intelligence capabilities of large models themselves.

Thus, we boldly assert here that AutoResearch is a stepping stone to AGI. When this step is truly broken through, artificial intelligence will have embarked on its path of self-evolution, and that moment will also mark the beginning of a new chapter in human civilization.