Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Mistral Small 4 Eagle
Model family: mistral-small
Eagle speculative-decoding head for Mistral Small 4 — pair it with the base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning. for faster inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs. throughput. Architectural extension, not standalone.
Listing Notes
This isn't a standalone model — it's an Eagle-architecture speculative-decoding head designed to accelerate inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs. on Mistral Small 4. Speculative decoding works by having a small fast model predict several draft tokens ahead of the main model, which the main model then verifies or rejects in parallel. The net effect is higher tokens-per-second throughput on the same hardware. Catalogued as a separate listing (rather than collapsed into Mistral Small 4's access_methods) because it's an architectural extension with its own checkpointA specific saved version of a model at a particular point in training. When a creator releases "Llama 3.1 8B Instruct," they're releasing a checkpoint — a frozen snapshot of the model as it existed at the end of training. Most models ship only a single public checkpoint; some creators release multiple (base, instruct, reasoning variants of the same underlying model)., not a quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. of the base weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself.. Pair with mistralai/Mistral-Small-4-119B-2603 and use via vLLM's speculative decoding support.
Identity
- Creator
- Mistral AI
- Model family
- mistral-small
- Release date
- 2026-04-07
Technical specs
- Parameter count
- Small speculative-decoding head (typically hundreds of millions of parameters) designed to predict draft tokens ahead of the main Mistral Small 4 model. Not usable standalone — must be paired with the base Small 4 checkpoint.
- Context window
- 262K tokens
- Modalities
- Image Input
- Text
- Primary capabilities
- Chat
- Instruction Following
License
- License
- Apache 2.0
- Commercial use
- Allowed
- Terms
- Modification ✓
- Redistribution ✓
- Attribution ✓
Access
- Openness
- Open Weight
- Access methods
- Local Runtime Vllm
- Weights Download Hf
- Cost tier
- Self Hosted Only
- llm
- open-weight
- commercial-friendly
- inference-acceleration
- speculative-decoding
- eu-based
- apache-licensed
- architectural-extension