Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →
Llama 3.3 70B Instruct
Model family: llama-3-3
- llm
- open-weight
- commercial-friendly
- large
- long-context
- multilingual
- us-based
- tool-use
- reasoning
Quick Take
Meta's efficiency milestone — a 70-billion-parameter model that Meta claims matches their much larger 405B model on most tasks, at a fraction of the cost to run.
Plain-English Description
Llama 3.3 70B Instruct is Meta's December 2024 entry in the Llama family, and its pitch is straightforward: roughly the same capability as the flagship Llama 3.1 405B, in a model that's about six times smaller. For anyone running open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. models in production — or anyone paying API fees for the 405B — that's a meaningful shift. Meta achieved it through improvements in training techniques and post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. refinement rather than architectural changes; the model uses the same Grouped-Query AttentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. and 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. as other Llama 3.1/3.3 models.
"70B" means 70 billion parameters, which puts this firmly in the "large open-weight model" category. In practical terms, this is a model you run on serious server-class GPUs or rent from an API provider — it's not going to run usefully on a laptop or even a single consumer GPUA GPU designed for desktop PCs and gaming — typically Nvidia RTX 3090, 4090, 5090 or similar. Consumer GPUs have 8-32GB of VRAM and cost a few thousand dollars each. Capable of running small and medium models, especially when quantized. The boundary between "runs on a consumer GPU" and "needs a datacenter GPU" roughly separates small from large models in the catalog. without heavy compromises. The tradeoff is that the capability is genuinely strong: on Artificial AnalysisAn independent benchmarking site that runs standardized tests across commercial and open-weight models and publishes comparable results on capability, speed, and cost. Widely cited for API provider comparisons — if you want to know whether Llama 3.3 70B is faster on Groq or Together, Artificial Analysis is the reference.'s independent Intelligence Index, Llama 3.3 70B scores 14, above the 8B model's 12 and competitive with other models of similar scale. Meta's own benchmarks show substantial gains over the predecessor 3.1 70B on reasoning, coding, math, and instruction-following.
For a business deciding between model sizes in the Llama family, this is the sweet spot for anyone who needs real capability and can justify the infrastructure — either by hosting it themselves on capable hardware, or by paying API rates that are still cheaper than frontier closed models while getting open-weight benefits (no vendor lock-in, data stays on your infrastructure, ability to fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. for your specific use case).
Best For
- Production AI features where you've outgrown small models but don't want to pay frontier-model API prices
- Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. deployments on existing GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. infrastructure — when you already have the servers and want to stop paying per-tokenThe basic unit of text a model reads and writes. Tokens are roughly three-quarters of a word in English — so 100 tokens is about 75 words. Models don't see letters or words directly; they see tokens. Pricing is almost always quoted per million tokens, and context windows are measured in tokens rather than words. fees
- Complex reasoning, coding, and math tasks where the 8B model falls short
- Long-context document analysis — contracts, research reports, multi-document synthesis with the 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run.
- Fine-tuning for a specialized domain where you need capability, not just economy
Not For
- Laptop or single-consumer-GPUThe specialized chip that runs most AI models. Originally designed for 3D graphics, GPUs turned out to be excellent at the math AI requires. Nvidia dominates the AI GPU market; common datacenter models include the H100, H200, and B200. Running an AI model without a GPU is possible but painfully slow for anything but the smallest models. deployment — not realistic at this size without severe quality loss
- Small businesses without existing cloud/GPU infrastructure who just want a chatbot — Llama 3.1 8B or an API is a better starting point
- Vision, audio, or multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default. work — this is text-only (Llama 4 Scout is Meta's current multimodal option)
- Real-time latency-critical applications at high volume without significant optimization work
- Organizations above 700M monthly active users without a separate Meta license
License — Plain-English Summary
Same commercial terms as the rest of the Llama 3 family. Free to use commercially unless your product had over 700 million monthly active users when 3.3 launched in December 2024. Modify it, redistribute it, fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. it, and ship products on it — you just need to credit Meta and include the license file. For the overwhelming majority of businesses, this is a permissive commercial license. Derivative model names must start with "Llama".
How It Compares
- Llama 3.1 8B Instruct (see Llama 3.1 8B Instruct — same family, same license, much smaller, much cheaper to run; the right choice when capability isn't the ceiling)
- Llama 3.1 405B Instruct (the predecessor flagship; Llama 3.3 70B is designed to replace it for most uses at lower cost)
- Llama 4 Scout (see Llama 4 Scout Instruct — newer, multimodalA model that can handle more than one type of input or output — typically text plus images, sometimes plus audio or video. "GPT-4 Vision" and "Llama 3.2 11B Vision" are multimodal models that accept both text and images. A text-only model is called "unimodal" but nobody uses that term; text-only is the assumed default., uses mixture-of-experts architecture; different strengths and a longer context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. but restricted in the EU)
Under the Hood
Llama 3.3 70B is a dense decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots."-only transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names. with Grouped-Query AttentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible. (GQA), 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., and the same overall architecture as Llama 3.1 70B. Meta's gains over the predecessor came from improved training data curation and significantly enhanced post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. — supervised fine-tuningA post-training method where the model is trained on example pairs of input and desired output. SFT is typically the first post-training step after pretraining — the base model sees many examples of "here's an instruction, here's a good response" and learns to follow that pattern. Often followed by RLHF for further polish. on over 25 million synthetically generated examples, plus RLHFA post-training method where humans rate the model's outputs and the model learns to produce outputs that humans prefer. RLHF is what makes instruct-tuned models feel helpful and polite rather than robotic. It's also what most people mean when they talk about "alignment" — shaping the model's behavior to match human preferences.. PretrainingThe first and most expensive phase of training a model, where it learns general language and knowledge from enormous datasets — typically trillions of tokens of text scraped from the internet, books, code, and other sources. Pretraining produces a base model. Major labs spend millions to hundreds of millions of dollars on a single pretraining run. data cutoff is December 2023, consistent with the rest of Llama 3.x. The model supports function calling, structured outputs, and the full 128K context. It was trained on roughly 15 trillion tokens and is available in both standard BF16 and FP8-dynamic variants for efficient inferenceRunning a model to get outputs — as opposed to training it. When you send a prompt to ChatGPT, that's inference. Inference is much cheaper than training per operation but adds up quickly at scale. Pricing pages almost always refer to inference costs (per million tokens, per request, etc.), not training costs.. Fine-tuning is broadly supported across the Hugging Face and open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. tooling ecosystem.
Cost
- Self-hosted cost
- $0.00 beyond compute
- API input (per 1M tokens)
- $0.58
- API output (per 1M tokens)
- $0.71
- API providers
- together, groq, fireworks, openrouter
- Notes
- Roughly 10x the API cost of Llama 3.1 8B, reflecting the larger model. Self-hosted is still free beyond compute, but compute is substantially more expensive at this size. Figures are representative as of verification date.
Hardware requirements
- Min VRAM
- 40 GB
- Recommended VRAM
- 80 GB
- Runs on laptop
- No
- Notes
- 4-bit quantized needs ~40GB VRAM (a single A100 or 2x consumer cards with model sharding). Full precision wants 140GB+, typically 2x A100 or H100. Not practical on consumer hardware without aggressive quantization and performance tradeoffs.
Comparable models
Commercial-use conditions
Free for commercial use unless your product had more than 700 million monthly active users on December 6, 2024 (the Llama 3.3 release date). Past that threshold, a separate Meta license is required.