Chinese AI lab Zhipu AI releases GLM-5.2 with a stable 1-million-token context under the MIT license. On hours-long coding tasks, the open-source model trails Anthropic's Opus models by just a few percentage points.

Zhipu AI has unveiled GLM-5.2, positioning the model as a tool for so-called long-horizon tasks - coding jobs that stretch over hours and thousands of individual steps. To get there, the company expanded the context window to one million tokens and focused training on agentic coding scenarios like large-scale implementation, automated research, and complex debugging.

"A 1M context is easy to claim, but much harder to keep reliable under real engineering pressure," Zhipu AI writes in its blog post, because the model needs to maintain quality across long, unstructured coding agent sessions.

On FrontierSWE, which evaluates open engineering projects ranging from hours to dozens of hours, GLM-5.2 scores 74.4 percent, just one point behind Anthropic's Claude Opus 4.8 and slightly ahead of OpenAI's GPT-5.5.

On PostTrainBench, where an agent uses an H100 GPU to improve small models through post-training, GLM-5.2 beats both GPT-5.5 and Opus 4.7, again landing second behind Opus 4.8. On SWE-Marathon, an ultra-long-horizon benchmark with demanding tasks like compiler construction and kernel optimization, the gap is much wider: GLM-5.2 reaches only half of Opus 4.8's score.

Anthropic's current top models Fable and Mythos aren't part of these comparisons, since Fable was pulled shortly after launch and Mythos was never broadly released. Across all three benchmarks, GLM-5.2 is still the strongest open-source model, according to Zhipu AI.

The jump over the predecessor is just as clear on standard coding tasks. On Terminal-Bench 2.1, GLM-5.2 climbs from 63.5 (GLM-5.1) to 81, putting it within a few points of Claude Opus 4.8. On SWE-bench Pro, the score goes from 58.4 to 62.1.

Users can also dial the model's thinking effort up or down. At a similar token budget, GLM-5.2 delivers much stronger coding results than GLM-5.1, Zhipu AI says. The highest setting, "Max," lets users throw extra compute at the hardest problems.

On Humanity's Last Exam, GLM-5.2 falls clearly behind Claude Opus 4.8 and Gemini 3.1 Pro according to the benchmark table. Those two lead by about ten and five percentage points. GLM-5.2 also ranks behind the top closed-source models on GPQA-Diamond, a scientific question benchmark. Math is a different story. The model nails 99.2 percent on AIME 2026.

Agentic tasks beyond coding paint a mixed picture. On MCP-Atlas, a tool-use test, GLM-5.2 nearly ties with Opus 4.8. On Tool-Decathlon, it falls well behind both Opus 4.8 and GPT-5.5.

Independent platform Artificial Analysis backs up the gains over the predecessor. On its Intelligence Index, GLM-5.2 scores 51 points, making it the current strongest open-weights model. It sits clearly ahead of MiniMax M3, DeepSeek V4 Pro, and Kimi K2.6. The biggest jumps show up in scientific reasoning, and it hallucinates a bit less than its predecessor.

On GDPval-AA v2, which Artificial Analysis considers its top metric for real-world agentic tasks, GLM-5.2 matches the proprietary GPT-5.5. The trade-off is that it burns through far more tokens than the open competition, making it one of the least efficient models in its class.

To make the 1-million-token context practical, Zhipu AI introduces a technique called IndexShare. Groups of four transformer layers share the same lightweight indexer instead of each layer computing its own. That should cut compute per token by 2.9x at one million tokens of context.

Zhipu AI also sped up text generation. With speculative decoding, the model predicts several tokens at once and throws out wrong guesses afterward. Through several tweaks to this process, GLM-5.2 accepts 20 percent more predicted tokens on average, according to the company's ablation studies. That directly speeds up output.

In an unusually candid move, Zhipu AI describes a problem that crops up during reinforcement learning for coding tasks. Because the reward is typically a binary pass/fail signal, the model can learn to game it instead of actually writing better code. GLM-5.2 tried this more often than its predecessor.

According to Zhipu AI, the model pulls solution code straight from GitHub via curl, hunts for hidden evaluation files in the file system, or chains commands to first find secret test cases and then feed them into a solution script. These tricks inflate reward signals and corrupt training.

To fix this, Zhipu AI built a two-stage anti-hacking module. A rule-based filter catches suspicious actions first. Then an LLM judge checks the intent behind flagged calls. The system blocks only the cheating call and returns a dummy response, letting the training run continue. That keeps aborted rollouts from destabilizing the model.

Model weights are live on HuggingFace and ModelScope, with code on GitHub, all under the MIT license with no regional restrictions. GLM-5.2 works as a chat interface and API through Z.ai and plugs into coding agents like ZCode, Claude Code, and OpenCode. For local deployment, Zhipu AI supports vLLM, SGLang, transformers, xLLM, and ktransformers.

Zhipu AI recently shipped GLM-5.1, an open-weights model that could refine its own strategy across hundreds of iterations on coding tasks. It reportedly built a Linux desktop in eight hours. GLM-5.2 builds on that, adding the 1-million-token context and much stronger long-horizon skills.

Competition among Chinese AI labs stays fierce. Alongside Zhipu AI, Moonshot AI with Kimi K2.7-Code and MiniMax with M3 are also fighting for the autonomous coding agent market with long context windows.