March 17, 2026 Multi-Modal Understanding / Training / Model Architecture

Research notes about agi

A detailed reading note on Kimi K2.5, native multimodal training, and parallel agent orchestration.

Kimi K2.5 Deep Outline

The most important claim in the Kimi K2.5 report is not simply that it is “a stronger model.” It is trying to validate a broader direction: native multimodal joint training plus parallel agent orchestration is a more promising path than “language-first backbone + late vision add-on + sequential tool calling.”

1. What problem is this report actually trying to solve?

  • If the future of large models is about doing work, not just answering questions, then longer chains of thought and more tool calls are not enough.
  • Two bottlenecks show up quickly:
  • Multimodal ability is often treated as an attachment rather than a native capability.
  • Most agent systems still run in a largely sequential manner, which increases latency, context clutter, and fragility as tasks become more complex.
  • Kimi K2.5 proposes two answers:
  • In training, optimize text and vision together from pre-training through reinforcement learning.
  • In execution, teach the system to decompose tasks, spawn sub-agents, and run work in parallel.

2. The two central claims of the report

2.1 Multimodality should be trained natively, not patched in later

  • The report pushes back against a common assumption: first build a strong language model, then inject a large amount of vision data late in training.
  • Their conclusion is the opposite of what many people expect: with a fixed token budget, earlier fusion with a lower vision ratio works better.
  • That implies multimodal ability is not just a late-stage extension. It depends on early shared representation learning between text and vision.
  • The deeper point is that text and vision should not be viewed as two separate systems stitched together. They should be treated as different input channels feeding one reasoning system.

2.2 Agent performance should scale through orchestration, not just longer sequential reasoning

  • The report argues that most current agent systems are still dominated by sequential tool execution.
  • Sequential execution is acceptable for short tasks, but much weaker for broad search, multi-source verification, long document processing, and real software workflows.
  • Kimi K2.5 introduces Agent Swarm, where a trainable orchestrator decides:
  • whether to parallelize,
  • how to split the task,
  • how many sub-agents to create,
  • and how to aggregate their outputs.
  • This is not presented as a hand-written workflow. It is framed as a learned policy optimized with reinforcement learning.

3. The three most important multimodal ideas

3.1 Early fusion with a lower vision ratio

  • One of the most interesting experimental results is that a higher vision ratio does not automatically produce better multimodal ability.
  • Under a fixed total token budget, early fusion + lower vision ratio beats later and more aggressive vision injection.
  • This suggests that the key variable is not simply “more vision data,” but how early and how consistently the model learns shared multimodal structure.

3.2 Zero-Vision SFT

  • This is one of the strongest methodological ideas in the report.
  • Kimi proposes a post-training stage called zero-vision SFT, where they use only text SFT data to activate visual reasoning and visual tool use.
  • Their claim is especially notable because they also report that manually designed visual trajectories actually hurt generalization.
  • If this finding holds broadly, it has an important implication:
  • once joint pre-training is strong enough, the model may not need large amounts of expensive, brittle, manually curated visual SFT data.
  • In this view, text supervision can serve as an activator for visual capability rather than a separate training track.

3.3 Joint multimodal RL

  • Kimi then applies reinforcement learning jointly across text and vision tasks.
  • A useful design choice here is that they organize RL by capability domains rather than by modality.
  • In other words, they do not create one expert track for text and another for vision. They group training around abilities such as knowledge, reasoning, coding, and agent behavior.
  • This increases the chance that a gain in one modality transfers into the same underlying capability in another modality.
  • The strongest evidence they present is that visual RL improves text benchmarks:
  • MMLU-Pro: 84.7 -> 86.4
  • GPQA-Diamond: 84.3 -> 86.4
  • LongBench v2: 56.7 -> 58.9
  • That is a strong signal that visual training is not merely a cost center for language performance. It may improve calibration on structure-heavy tasks such as counting, OCR-like extraction, and grounded reasoning.

4. What is actually smart about Agent Swarm?

4.1 It is not “many agents”; it is a trainable coordinator

  • The architecture is simple in concept:
  • one trainable orchestrator,
  • many frozen sub-agents.
  • The essential capability is not the number of agents. It is whether the orchestrator can learn the right decomposition and scheduling policy.
  • This is an important distinction because many multi-agent systems are just parallel copies of the same model without a real coordination mechanism.

4.2 Why the sub-agents are frozen

  • The report explicitly avoids end-to-end joint optimization of all agents.
  • Their reason is practical and convincing:
  • credit assignment ambiguity,
  • and training instability.
  • In a multi-agent environment, final rewards are sparse and noisy. A successful outcome does not mean every sub-agent behaved well, and a failed outcome does not mean every sub-agent behaved badly.
  • So they freeze the sub-agents and treat their outputs as part of the environment, while only training the orchestrator.
  • This is a strong engineering decision. It prioritizes stable convergence over conceptual purity.

4.3 The PARL reward design is worth paying attention to

  • Their parallel-agent RL setup uses three reward components:
  • r_perf: task-level outcome quality,
  • r_parallel: encouragement to actually explore parallel instantiation,
  • r_finish: whether assigned sub-tasks are completed effectively.
  • This design targets two specific failure modes:
  • serial collapse: the system defaults back to a single-agent sequential pattern,
  • spurious parallelism: the system creates many sub-agents just to inflate parallel metrics without useful decomposition.
  • The auxiliary rewards are annealed away over time, which keeps the final objective tied to actual task success rather than to proxy behavior.

4.4 “Critical steps” is one of the most important ideas in the whole paper

  • Instead of measuring cost by total steps, the report defines critical steps, which approximate the length of the critical path in a parallel computation graph.
  • This is a better fit for real user experience because users care about end-to-end waiting time, not about how much total internal work the system performed.
  • The implication is important:
  • useless parallelism does not help,
  • and the real goal is to shorten the longest branch.
  • That is a much more mature target than simply maximizing concurrency.

5. A deeper view of context management

  • The report makes a useful distinction between two ways to handle long-horizon tasks.
  • Traditional methods rely on context truncation:
  • summarize history,
  • hide tool results,
  • or discard parts of the trace when the context gets too large.
  • Agent Swarm instead behaves more like context sharding.
  • Long tasks are broken into semantically isolated sub-tasks.
  • Each sub-agent works with its own bounded local context.
  • Only task-relevant outputs are returned to the orchestrator.
  • This matters because it reduces contamination of the global context and preserves structure better than reactive compression.
  • The broader insight is that context management does not always have to mean compressing history. It can also mean designing the system so unnecessary history never enters the shared global context in the first place.

6. Important training and systems details

  • Kimi K2.5 is built on top of Kimi K2, a very large MoE foundation model.
  • The report says K2.5 uses roughly 15T mixed text and visual tokens in joint pre-training.
  • Its visual encoder is MoonViT-3D, which supports native-resolution image processing and a unified image-video representation space.
  • A lightweight temporal compression design lets it process videos up to roughly 4x longer within the same context window.
  • The model supports a 256k context window.
  • On the systems side, they introduce DEP (Decoupled Encoder Process), which separates parts of the vision encoder workflow from the main backbone pipeline to reduce multimodal load imbalance.
  • Another practically important method is Toggle:
  • it alternates between budget-constrained reasoning and unconstrained test-time scaling during RL.
  • The reported effect is a 25% to 30% reduction in output tokens with little performance loss.
  • This is a strong sign that they are not only optimizing for capability ceilings, but also for usable inference efficiency.

7. How to read the results without overstating them

7.1 Where Kimi K2.5 looks genuinely strong

  • It performs very strongly on math and reasoning tasks, including AIME 2025: 96.1.
  • On tool-enabled HLE-Full, it reaches 50.2, which is one of the most meaningful numbers in the report.
  • It is also strong on software engineering, including SWE-Bench Verified: 76.8 and LiveCodeBench v6: 85.0.
  • Agent search is a clear highlight:
  • BrowseComp: 60.6
  • BrowseComp with context management: 74.9
  • BrowseComp with Agent Swarm: 78.4
  • WideSearch improves from 72.7 to 79.0 under Agent Swarm, with reported wall-clock savings of 3x to 4.5x.
  • OCR, document understanding, and video understanding are also strong, including OCRBench: 92.3 and VideoMMMU: 86.6.

7.2 Where caution is still needed

  • The model is not uniformly the best on every benchmark in the table.
  • On some tasks, GPT-5.2, Gemini 3 Pro, or Claude Opus 4.5 still lead.
  • That means the report is most convincing as evidence for a systems direction, not as proof that Kimi K2.5 dominates every frontier capability.
  • The strongest takeaway is not “it wins everything.”
  • The stronger takeaway is “this training-and-execution design appears to work.”

8. The real conceptual upgrades from this report

  • Multimodal ability should not be interpreted as “adding eyes to an LLM.” It should be treated as shared representation learning across multiple input channels.
  • High-quality joint pre-training may matter more than large volumes of handcrafted visual SFT.
  • Visual capability is not necessarily a trade-off against language ability. It may actively improve language reasoning quality.
  • The scaling limit of agents is not just longer chains of thought. It is better decomposition, better scheduling, and better control of the critical path.
  • The most advanced form of context management may be architectural separation rather than ever more aggressive compression.
  • The value of a multi-agent system is not that it looks sophisticated. The value is that it makes hard tasks more executable under real latency and context constraints.

9. Product and strategy implications

  • If you are building multimodal models, think about unified training from pre-training onward rather than treating vision as an adapter layer.
  • If you are building agent products, do not assume sequential execution should remain the default for all hard tasks.
  • Agent Swarm-like designs are especially suitable when tasks involve:
  • wide information search,
  • relatively independent sub-problems,
  • multi-source verification,
  • long document reading,
  • large file or artifact handling,
  • or very long-form generation.
  • If you want a trainable multi-agent architecture, the report offers a practical template:
  • keep the workers stable,
  • train the orchestrator,
  • and optimize for coordination first.
  • If you care about context management, do not only think in terms of summarization and truncation. Think in terms of structural task partitioning.
  • If you care about inference efficiency, optimize not only token count but also critical-path latency.

10. Limits and reservations

  • Some reported scores rely on internal re-evaluation, so full external replication still matters.
  • Agent Swarm is most compelling on tasks that naturally benefit from parallel decomposition. That does not mean every task should become multi-agent.
  • The in-house swarm benchmark is useful, but like all internal benchmarks, it is more favorable to the author’s design choices than a fully neutral external benchmark would be.
  • The report is persuasive on systems design, but actual deployment will still face hard constraints in cost, orchestration complexity, and stability.
  • Open-sourcing the post-trained checkpoint is valuable, but reproducing the full data quality and training process remains difficult.

11. One-sentence conclusion

The real importance of Kimi K2.5 is not a few benchmark wins. It is that the report makes a broader case that the next serious leap in model capability may come from native multimodality, trainable parallel orchestration, and optimization for critical-path execution rather than from single-model reasoning alone.

Discussion

Configure Giscus in src/data/site.ts with your repository and category IDs to enable comments.