Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Devstral Small 2 24B Instruct
Model family: devstral
- llm
- open-weight
- commercial-friendly
- mid
- long-context
- coding
- agentic
- laptop-friendly
- vision
- eu-based
- apache-licensed
Quick Take
Mistral's laptop-class coding specialist — 24B parameters, Apache 2.0, runs on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog., and beats 70B-class competitors on software-engineering benchmarks.
Plain-English Description
Devstral is Mistral's coding specialist family, and Devstral Small 2 24B Instruct is the version most teams should actually care about. The "Devstral 2" flagship at 123B parameters gets most of the benchmark headlines — it hits 72.2% on SWE-Bench Verified — but it carries a custom "modified MIT" license with commercial restrictions that limit its deployability. Devstral Small 2 24B is the cleanly-licensed Apache 2.0 option: smaller, still very capable, and specifically engineered to run on hardware you already have.
The sizing isn't accidental. Mistral explicitly built Devstral Small 2 to fit on a single RTX 4090 or a MacBook with 32GB of unified memory. That's a meaningful product decision for coding models because code is sensitive — teams often can't, or won't, ship source code out to third-party APIs for inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs.. With Devstral Small 2 quantized to 4-bit GGUF, you can run an entirely local coding agent on a developer's own laptop without any code leaving the device. At 68% on SWE-Bench Verified, that local deployment gives you coding capability that was proprietary-only a year earlier.
Devstral is purpose-built for agentic coding, not just code completion. The model is trained and instruction-tuned to operate tool-using software-engineering agents — exploring codebases across many files, running terminal commands, making coherent multi-file edits, recovering from errors, and holding long plans in context across hundreds of tool calls. Mistral recommends the OpenHands scaffolding and ships a companion CLI called Mistral Vibe for terminal-based development workflows. The 256K-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. is specifically there to support whole-repository reasoning. Compared to IDE-completion-focused coding models (Codestral, the original one), Devstral is explicitly aimed at autonomous coding agents that do real engineering work over long sessions.
Best For
- Private, local-first coding assistants for enterprise developers. Run it on developer laptops or a shared internal GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models.. Code never leaves the organization.
- Agentic coding workflows requiring long-context reasoning. Whole-repository edits, multi-file refactors, autonomous bug-fix agents — the 256K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. and the agentic post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. are there for this.
- Cost-optimized hosted coding APIs. At $0.10 input / $0.30 output, Devstral Small 2 on Mistral's API is among the cheapest capable coding models available. For high-volume coding agent deployments where tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. cost matters, this is the economics play.
- Teams who want an Apache 2.0 coding model. The cleanly-licensed alternative to Devstral 2 (123B, custom license) and to proprietary U.S. coding APIs. Modify, fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch., redistribute without friction.
- Fine-tuning on proprietary codebases. The combination of open weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself., permissive license, and manageable size (24B fits on a single node for full fine-tuning) makes it the practical choice for teams wanting to specialize a coding model on their own code.
Not For
- Absolute top-tier SWE-Bench performance. The 123B Devstral 2 flagship scores 72.2% vs Devstral Small 2's 68%. If you need the highest benchmark number and can accept the custom license's commercial restrictions, the larger flagship is stronger. For most teams, the 4-point gap is worth the license clarity.
- General-purpose chat, reasoning, or vision tasks. Devstral Small 2 has vision and general language capability but is specifically post-trained for coding. For mixed workloads, Mistral Small 4 (which absorbs Devstral's coding capability into a general-purpose model) is a better default.
- Teams without engineering scaffolding. Devstral is designed to operate inside agentic scaffolds like OpenHands, Kilo Code, Aider, or Mistral Vibe. Using it as a naked chat modelShorthand for an instruct-tuned model specifically designed for back-and-forth conversation rather than single-shot tasks. Chat models remember earlier turns in the conversation (within the context window) and respond in a conversational register. GPT-4, Claude, and most Llama Instruct variants are chat models. In practice, "chat model" and "instruct-tuned model" often mean the same thing. without tool-use orchestration leaves most of its capability on the table.
- Extremely constrained hardware (less than 16GB VRAMThe memory built into a GPU. VRAM size determines what models you can load and run — a model's weights must fit in VRAM (or be cleverly swapped in and out). A 7B model in 4-bit quantization needs about 6GB of VRAM; a 70B model in 4-bit needs about 40GB; full-precision frontier models need multiple high-end GPUs. When people talk about a model "fitting" on a GPU, they mean VRAM.). At aggressive quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. Devstral Small 2 can run on 12GB, but performance degrades. For truly small hardware, reach for Ministral 3 8B or 3B instead.
License — Plain-English Summary
Apache 2.0. Commercial use allowed, modifications allowed, redistribution allowed, include the license file. No conditions, no revenue caps, no special terms. This is the permissive Devstral. Do not confuse with the larger 123B Devstral 2 flagship, which uses a different license ("modified MIT") with commercial use restrictions tied to revenue — a meaningfully different legal posture.
How It Compares
- vs. Devstral 2 (123B) — The 123B flagship is more capable on SWE-Bench (72.2% vs 68%) but carries a custom "modified MIT" license that restricts commercial use above a revenue threshold. For any commercial deployment where license clarity matters, Devstral Small 2 is the better starting point.
- vs. Mistral Small 4 — Small 4 is a general-purpose model that absorbs Devstral's coding capability plus reasoning, vision, and agentic behavior. If you need mixed workloads, Small 4 is better. If you specifically need a coding-focused model with its full capability weighted toward software engineering, Devstral Small 2 is the specialist.
- vs. Qwen 3 Coder Flash (30B) — Mistral claims Devstral Small 2 outperforms Qwen 3 Coder Flash on agentic coding benchmarks despite being smaller. Both are Apache 2.0. Close competitors; evaluate on your own workload.
Under the Hood
Devstral Small 2 is a 24B-parameter dense transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. post-trained for agentic software engineering. Architecturally it shares Ministral 3's structure with rope-scaling (inspired by Llama 4) and scalable-softmax attentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible.. The 256K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. uses attention optimizations to avoid quadratic blow-up. The model supports Mistral's function-calling format natively and is compatible with OpenHands, Mistral Vibe, Kilo Code, Aider, and Cline as agentic scaffolds.
Benchmark performance as of launch: 68.0% on SWE-Bench Verified (real-world GitHub issues), noted by Hugging Face's Head of Product as potentially "the new local coding king." On the Trelis and Unsloth community evaluations, Devstral Small 2 generalizes well to fine-tuning and retains its agentic behavior through LoRAA lightweight fine-tuning method that adds a small number of new parameters to a frozen base model rather than retraining the whole thing. LoRA adapters are tiny (often a few hundred megabytes versus the base model's tens of gigabytes), fast to train, and can be swapped in and out. Useful when you want many specialized variants of the same base model without storing a full copy for each. training when the audio tower is frozen.
Available on Mistral's API as devstral-small-2, on Hugging Face as mistralai/Devstral-Small-2-24B-Instruct-2512, and as GGUF quantizations via unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF and similar community releases for direct llama.cpp / Ollama / LM Studio use.
Cost
- Self-hosted cost
- $0.00 beyond compute
- API input (per 1M tokens)
- $0.10
- API output (per 1M tokens)
- $0.30
- API providers
- mistral, openrouter, fireworks
- Notes
- Same API pricing as Mistral Small 3.1 per Mistral's launch positioning. Self-hosting is free beyond compute costs; runs on a single RTX 4090 at Q4 quantization or on a 32GB Mac.
Hardware requirements
- Min VRAM
- 16 GB
- Recommended VRAM
- 48 GB
- Runs on laptop
- Yes
- Notes
- Q4-quantized GGUF runs comfortably on a single consumer GPU (RTX 4090, RTX 3090, or similar). Full BF16 precision needs ~48GB VRAM. 32GB unified-memory Apple Silicon Macs handle it through llama.cpp / LM Studio.