Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Gemini 3.5 Flash
Model family: gemini-3-5
- llm
- closed-api
- frontier
- multimodal
- long-context
- coding
- agentic
- us-based
- proprietary
Quick Take
Google's new default model: frontier-tier coding and reasoning, full text-image-audio-video input, a million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. memory, and Flash-tier speed — free in Google's apps, cheap on the API.
Plain-English Description
Gemini 3.5 Flash, launched at Google I/O in May 2026, is the model now powering the Gemini app and AI Mode in Google Search for hundreds of millions of people — and it's Google's recommended default for developers too. "Flash" used to mean "the cheap, fast, weaker option," but Google flipped that: this Flash actually outperforms the previous generation's Pro model on coding and agentic tasks while running about four times faster. The headline framing from Google is that the everyday model is now strong enough to handle work that used to require the flagship.
It's natively multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. in the fullest sense — a single request can mix text, images, audio, video, and PDFs, with text out — and video understanding in particular is a Google strength. It carries a one-million-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. (enough for hundreds of pages or a large codebase at once) and a "dynamic thinking" mode that decides how much step-by-step reasoning to spend, with manual levels if you want to tune the cost/quality trade-off.
The practical catch is that it's closed: there are no weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself. to download, so everything runs through Google's API or apps. For individuals that's a feature (it's free in the Gemini app); for businesses with strict data-residency needs it means relying on Google Cloud's terms rather than self-hosting. At $1.50 in / $9 out per million tokens it's mid-priced among frontier models — cheaper than the top closed flagships, pricier than the budget tiers.
Best For
- General-purpose frontier work where you want one fast, capable, multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. model for most tasks.
- Agentic and coding workflows — it's tuned for tool use and parallel agent loops.
- Multimodal jobs involving video or audio, where Gemini's understanding leads the field.
- Long-document and large-codebase analysis that benefits from the 1M-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. window.
- Individuals and teams already in the Google ecosystem (Workspace, Cloud, Search) who want tight integration — and free access in the app.
Not For
- Anyone who needs to self-host or keep data fully in-house — it's closed and API-only; use Gemma 4 31B for that.
- The very hardest deep-reasoning or precise long-context retrieval, where the Pro tier still leads — see Gemini 3.1 Pro (and the imminent 3.5 Pro on our watchlist).
- Cost-minimizing high-volume workloads where a cheaper budget tier or an open self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. model would do — at $1.50/$9 it's not the cheapest option.
- Teams wanting to fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. or own the model.
License — Plain-English Summary
There's no open license here — Gemini 3.5 Flash is proprietary, accessed through Google's API under Google Cloud's terms. You get commercial rights to what you build with the outputs, but no rights to the model itself: no weightsThe numerical values inside a trained model that encode everything it has learned. A model is, functionally, a giant list of weights — tens of billions of numbers for a mid-sized model, hundreds of billions for a frontier model. "Open-weight" means those numbers are published. "Downloading the weights" means getting the actual file you'd need to run the model yourself., no modification, no redistribution. Treat it like any closed frontier API: your diligence is Google's Generative AI Prohibited Use Policy and the data-handling terms of the Gemini API / Vertex AI, which (being Google Cloud) include enterprise data-residency and processing options. If owning or self-hosting matters, Gemma is the Google line to look at instead.
How It Compares
Against Gemini 3.1 Pro, Flash is faster and cheaper and now beats it on coding and agentic benchmarks, but 3.1 Pro still leads on the hardest academic reasoning and precise long-context retrieval — pick Flash for most work, Pro for the deep end (until Gemini 3.5 Pro ships). Against Gemma 4 31B, the open Google option, Flash is more capable and fully managed but closed — Gemma is the choice when you need to self-host. Against the other closed frontier models (GPT-5.5, Claude Opus 4.7, and the China-based closed flagships), Gemini's edge is native multimodality — especially video — the 1M context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., and Google ecosystem integration, at a price that undercuts the top Western flagships.
Under the Hood
Gemini 3.5 Flash is the first model in the Gemini 3.5 family; the API model ID is gemini-3.5-flash (no preview suffix), internal version 3.5-flash-05-2026, with a January 2026 knowledge cutoff. Independent measurement put it around 55 on the Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference. Intelligence Index at launch, with roughly 186 output tokens/second — fast for a reasoning-capable model. Reported agentic/coding scores include 76.2% on Terminal-Bench 2.1, 83.6% on MCP Atlas (tool use), and 84.2% on CharXiv reasoning (multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. charts). It supports function calling, structured output, search-as-a-tool, and code execution, and is distributed across the Gemini API, Google AI Studio, Vertex AI, Google Antigravity, Android Studio, the Gemini app, and AI Mode in Search.
Cost
- API input (per 1M tokens)
- $1.50
- API output (per 1M tokens)
- $9.00
- API providers
- google-gemini-api, google-vertex-ai, openrouter
- Notes
- $1.50 input / $9.00 output per million tokens; cached input $0.15 (non-global regions $1.65 / $9.90). Free for consumers in the Gemini app and AI Mode in Search. Max output 65,536 tokens. No self-hosting — closed model.
Comparable models
Commercial-use conditions
Commercial use is permitted through the Gemini API / Vertex AI under Google Cloud's terms. You're buying access, not the model — no weights, no modification, no redistribution.