← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · OpenAI

gpt-oss-20b

Model family: gpt-oss

Size
mid (21.0B params)
Context
131,072 tokens
Released
2025-08-04
Openness
open-weight
License
Cost tier
mixed
Rating
4.0 — An excellent on-device reasoning model — o3-mini-class quality on 16GB of memory, clean Apache 2.0, with tool use and adjustable reasoning. Text-only and naturally limited by its size, hence 4.0.
Modalities
text
Capabilities
chat, coding, function-calling, instruction-following, long-context, reasoning, tool-use
Access
local-runtime-llama-cpp, local-runtime-lm-studio, local-runtime-mlx, local-runtime-ollama, local-runtime-vllm, weights-download-hf

Quick Take

OpenAI's on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. open model: o3-mini-class reasoning that runs locally on 16GB of memory, under clean Apache 2.0 — download it and run it on a good laptop.

Plain-English Description

gpt-oss-20b is the smaller of OpenAI's two open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. models, built for local and on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. use. Where the 120b needs a datacenter GPUA GPU designed for server and cloud use, typically Nvidia H100, H200, B200, or A100. Datacenter GPUs have 40-192GB of VRAM and cost tens of thousands of dollars each. Required for training frontier models and for running the largest models in full precision. Most cloud AI APIs run on datacenter GPUs behind the scenes., the 20b runs on 16GB of memory — a high-end laptop or desktop — while delivering reasoning quality OpenAI compares to its o3-mini model. It's a mixture-of-experts design (21B total, ~3.6B active per tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words.), which is how it stays light.

Like its larger sibling it's text-only and built for reasoning and agentic tasks: adjustable reasoning effort, full chain-of-thought, and native tool use (function calling, browsing, Python). The point is to put capable reasoning on hardware people already have, with no API, no per-token cost, and no data leaving the device.

For privacy-sensitive local applications, rapid prototyping, or embedding reasoning into a product without infrastructure spend, it's one of the stronger small open models — and the Apache 2.0 license makes it free to build on commercially.

Best For

  • On-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. and local reasoning where data stays on the machine.
  • Privacy-first or offline applications with no API dependency.
  • Rapid prototyping and iteration without inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs. costs.
  • Embedding reasoning and tool use into products on consumer hardware.

Not For

  • The strongest reasoning — step up to gpt-oss-120b or a closed flagship.
  • MultimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. tasks — it's text-only.
  • Workloads needing the largest context or deepest knowledge.
  • Anyone wanting frontier quality from an on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. model.

License — Plain-English Summary

Apache 2.0 — unrestricted commercial use, modification, fine-tuning, and redistribution, no royalties or carve-outs; keep the notices. OpenAI's short "gpt-oss usage policy" covers acceptable use without restricting commercial deployment. Running locally, it keeps all data on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. — ideal for privacy-sensitive products. Among the cleanest licenses available for an on-device model.

How It Compares

Against gpt-oss-120b, the 20b trades capability for portability — laptop-class versus datacenter-GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models.. Against Google's on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. open models like Gemma 4 E4B, gpt-oss-20b competes on reasoning and tool use under the same Apache 2.0 license, though Gemma adds multimodality. Against any cloud API, the difference isn't raw capability — it's that gpt-oss-20b runs entirely on your own device, free of per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. cost and data-routing concerns.

Cost

Self-hosted cost
$0.00 beyond compute
Notes
Free to self-host under Apache 2.0; runs locally via Ollama, LM Studio, llama.cpp, and similar. Adjustable reasoning effort (low / medium / high).

Hardware requirements

Min VRAM
16 GB
Recommended VRAM
24 GB
Runs on laptop
Yes
Notes
Runs on 16GB of memory — high-end laptops and desktops — making it a practical on-device reasoning model.

Comparable models

Commercial-use conditions

Apache 2.0 permits unrestricted commercial use, modification, fine-tuning, and redistribution. OpenAI attaches a short "gpt-oss usage policy" covering acceptable use; it doesn't restrict commercial deployment.

Sources