Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Qwen3.6-27B
Model family: qwen3-6
- llm
- open-weight
- commercial-friendly
- mid-size
- coding
- long-context
- self-hostable
- on-device
- china-based
- apache-2-0
Quick Take
The best open model you can actually run yourself: a dense 27B that beats Qwen's own 397B flagship on agentic coding while fitting on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog., under Apache 2.0.
Plain-English Description
Qwen3.6-27B, released in April 2026, is the model to point most self-hosters at. It's a "dense" model — meaning every one of its 27 billion parameters is used on every tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words., unlike the mixture-of-experts designs that switch on only a slice. Dense models are simpler to run predictably, and at 27B this one is small enough to fit on a single 24GB consumer graphics card after quantizationCompressing a model by reducing the numerical precision of its stored weights — for example, from 16-bit numbers to 4-bit numbers. The compressed model uses roughly a quarter of the memory and runs faster on most hardware, at the cost of slight accuracy loss. Quantization is what makes big models runnable on laptops — a 70B model in 4-bit quantization can fit on hardware that couldn't load the full-precision version. (about 16.8GB), or a high-end laptop.
The surprising part is how good it is for its size. On several agentic-coding benchmarks it actually edges out Qwen's own 397-billion-parameter flagship — for example, around 77.2 on SWE-bench Verified versus the bigger model's 76.2. That's the payoff of a focused, well-trained dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale.: you get near-frontier coding capability without needing a server full of GPUs. For a business that wants a capable coding or agent model running entirely in-house, on hardware it already owns, this is close to ideal.
Like the rest of the open Qwen tier, it's Apache 2.0, so there are no licensing strings on commercial use, modification, or fine-tuning. The main thing to know is that it's a focused workhorse, not a do-everything frontier model — for the absolute top scores you'd reach for the closed Max flagship or the much larger 397B.
Best For
- Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. coding and agentic workflows where you want strong capability on hardware you control.
- Privacy-sensitive teams that need everything to stay in-house on a single GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. — no API, no data leaving the building.
- Local development, on-deviceRunning a model directly on a consumer device — a laptop, a phone, a smart speaker — rather than in a data center. On-device inference keeps data private by never leaving the device, and works offline. Small models (under ~10B parameters, often quantized) can run on-device; larger models cannot yet. assistants, and edge deployments where a 27B dense modelA model where every parameter is used for every input — the entire model runs on every token. Contrast with sparse or Mixture of Experts models, which activate only a fraction of the model per input. Dense models are simpler and more predictable; MoE models are more efficient at scale. is the sweet spot.
- Cost-conscious teams avoiding per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. API fees entirely.
- Fine-tuning on a budget — small enough to adapt without a cluster.
Not For
- Maximum capability regardless of cost — the closed Qwen3.7-Max and the open Qwen3.5-397B-A17B both go higher.
- MultimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. work — this is a text model; for image or video understanding use the 397B or a dedicated vision modelA multimodal model that accepts images as input alongside text. Useful for describing images, extracting text from photos, analyzing charts or screenshots, and identifying objects. Vision models don't generate images — they read them. For generating images, you want an image-generation model, which is a separate category..
- Phone-class or very low-memory devices — for those, drop to the smaller Qwen3 dense sizes (8B, 4B, and down).
- Teams that specifically want a mixture-of-experts model for throughput reasons.
License — Plain-English Summary
Apache 2.0, the same clean, permissive license as the rest of the open Qwen tier: commercial use, modification, fine-tuning, and redistribution all allowed, no royalties, no user-count carve-out. Keep the notices, flag significant changes if you redistribute, and you're done. Because it's so easy to self-host, this is also the Qwen model with the cleanest privacy story — run it on your own GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. and no data ever leaves your environment.
How It Compares
Against Qwen3.5-397B-A17B, the 27B is far easier to run (one GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. versus eight) and actually wins on some coding suites, while giving up the bigger model's multimodality and broad-knowledge edge. Against Qwen3-Coder-30B-A3B, the dedicated coder, the 27B is a stronger generalist of similar size — pick the Coder for pure code-tooling workflows and the 27B when you want one capable local model for mixed work. Against other "best local model" picks like the mid-size Llama and DeepSeek-R1 distills, Qwen3.6-27B currently sets the bar for near-frontier capability on a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog..
Cost
- Self-hosted cost
- $0.00 beyond compute
- Notes
- Free to self-host under Apache 2.0; this is the model's whole pitch. Some third-party hosts also serve it for per-token pricing. Context window here follows the current Qwen generation's 256K-class standard.
Hardware requirements
- Min VRAM
- 18 GB
- Recommended VRAM
- 24 GB
- Runs on laptop
- Yes
- Notes
- Fits in about 16.8GB at Q4_K_M quantization — a single 24GB consumer GPU (e.g. an RTX-class card) runs it comfortably, and high-end laptops can manage it. This is the easiest near-frontier model to run on your own hardware.