Google Gemini 3.1 Pro Scores 77% on ARC-AGI-2. Here Is What That Means.
Google's latest model introduces three-tier adjustable reasoning that lets enterprises balance accuracy against cost on every API call.
A Benchmark Score Worth Understanding
Google DeepMind has released Gemini 3.1 Pro with a headline number: 77.1% on ARC-AGI-2, a benchmark designed to measure genuine reasoning ability rather than pattern matching. For context, GPT-4 scored below 5% on the original ARC-AGI when it launched. The benchmark has gotten harder since then, not easier.
More significant than the score itself is the architectural feature that produced it. Gemini 3.1 Pro ships with three-tier adjustable reasoning, letting developers control how much compute the model applies to each request. Google calls the lightest tier "Deep Think Mini," a mode optimized for speed and cost on straightforward queries while preserving the option to engage full reasoning on complex tasks.
Why Adjustable Reasoning Matters for Enterprise
Every enterprise AI deployment faces the same tension: accuracy costs money. Running maximum reasoning on every API call produces the best outputs but generates unsustainable inference bills at scale. Running minimal reasoning cuts costs but introduces errors on tasks that require genuine analysis.
Gemini 3.1 Pro addresses this with a dial rather than a switch. The three tiers map to distinct use cases. The lightest tier handles classification, extraction, and simple summarization at high throughput and low cost. The middle tier manages tasks like document comparison, code review, and structured analysis. The full reasoning tier tackles multi-step problems, strategic planning, and novel situations where the model needs to work through ambiguity.
For FinOps teams managing AI budgets, this is the first major model to offer granular cost control at the API level. Instead of choosing between a cheap model that fails on hard tasks and an expensive model that wastes compute on easy ones, enterprises can route requests dynamically based on complexity.
The FinOps Advantage
Google has not published exact pricing tiers for Gemini 3.1 Pro's reasoning levels, but early access partners report cost reductions of 40-60% on mixed workloads compared to running full reasoning uniformly. The savings come from the distribution of real-world enterprise queries: most requests are straightforward, and only 10-15% require deep analysis.
This pricing architecture creates a competitive advantage that goes beyond raw capability. Anthropic's Claude models and OpenAI's GPT series currently offer single-tier reasoning. To match Gemini's cost efficiency, they would need to ship equivalent adjustable reasoning features or accept margin compression on enterprise contracts.
Deep Think Mini and the Speed Tradeoff
The lightest reasoning tier, Deep Think Mini, targets latency-sensitive applications. Google reports sub-200-millisecond response times for simple queries, making it viable for real-time applications like customer-facing chatbots, search augmentation, and inline code completion.
The tradeoff is explicit: Deep Think Mini sacrifices accuracy on complex reasoning tasks. Google's published benchmarks show a 15-20 percentage point drop on multi-step reasoning problems compared to full reasoning mode. For enterprises, this means building routing logic that accurately classifies query complexity before selecting a reasoning tier.
Getting that routing wrong in either direction is expensive. Over-classifying simple queries as complex wastes compute. Under-classifying complex queries as simple produces unreliable outputs that erode user trust.
What to Watch
The real test for adjustable reasoning is not benchmark performance. It is whether enterprise developers can build reliable routing systems that consistently select the right tier. Google has released preliminary routing guidelines, but the tooling for automated complexity classification is still in preview.
Competing model providers will likely ship their own adjustable reasoning features within two to three quarters. The first-mover advantage here is not the feature itself but the ecosystem of routing tools, monitoring dashboards, and cost optimization frameworks that Google is building around it.
Enterprise buyers evaluating Gemini 3.1 Pro should run pilot deployments focused on mixed workloads where the cost distribution between simple and complex queries is well understood. The value proposition is strongest when 80% or more of queries can run on the lightest tier.
Technology decisions, clearly explained.
Weekly analysis of the tools, platforms, and strategies that matter to B2B technology buyers. No fluff, no vendor spin.
