← Back to hard AIs

Verify critical details — pricing, licensing, availability — with the model's source before business decisions. Full methodology →

Models · Meta · Llama 3.1 8B Instruct

Feature-frozen. The creator has frozen feature development on this model (critical fixes only).

Hermes 3 — Llama 3.1 8B

fine-tune derivative of Llama 3.1 8B Instruct by Nous Research

Full-parameter fine-tune of Llama 3.1 8B (base, not Instruct) produced by Nous Research. Adds improved function calling, structured output (JSON mode), better roleplaying behavior, stronger steerability via system prompts, and ChatML prompt format. Claimed to be competitive with or superior to Meta's own Llama 3.1 8B Instruct on most general capabilities.

Size
small (8.0B params)
Context
131,072 tokens
Released
2024-08-14
Openness
open-weight
License
Llama 3.1 Community License (inherited) · commercial: conditional
Cost tier
mixed
Rating
4.0 — A high-quality fine-tune that meaningfully improves on Meta's own Instruct version for steerability, function calling, and roleplay — but it's still fundamentally an 8B model, so it inherits the capability ceiling of its base. Choose this over Meta's Instruct when you want more control over behavior; stick with Meta's when first-party alignment matters more.
Modalities
text
Capabilities
chat, function-calling, instruction-following, long-context, multilingual, tool-use
Access
api-third-party, local-runtime-llama-cpp, local-runtime-lm-studio, local-runtime-ollama, local-runtime-vllm, weights-download-hf

Quick Take

Nous Research's full-parameter fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. of Llama 3.1 8B — trades first-party Meta alignment for better steerability, stronger function calling, and a more flexible prompt format.

Plain-English Description

Hermes 3 is Nous Research's fine-tuned version of Meta's Llama 3.1 8B base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning.. The quick version: same 8 billion parameters, same 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run., same general capability ceiling, but different personality and different strengths in how it responds to you. Nous took Meta's base Llama 3.1 8B (not the Instruct variant — they started from the raw pretrained model) and did their own instruction tuning on it, with a specific focus on function calling, structured output (JSON mode), and steerability via system prompts.

"Steerability" is the word that matters most here. Meta's own Llama 3.1 8B Instruct is tuned to Meta's standards for what a helpful, harmless assistant should be — reasonable defaults but relatively opinionated about what it will and won't do, and not especially responsive to attempts to change its voice through system prompts. Hermes is tuned in the opposite direction: more willing to adopt whatever persona, role, or behavioral rules you specify in the system prompt, and less likely to break character in the middle of a session. That's a feature if you're building an application where you want tight control over the model's behavior, and potentially a problem if you need the model to consistently refuse certain categories of request regardless of what your users prompt it with.

The other practical difference is the prompt format. Hermes uses ChatML — the same prompt format OpenAI's API uses — which makes it drop-in compatible with a lot of tooling that already expects that format. Meta's Instruct versions use their own Llama prompt format. If you're switching between multiple models in your stack, the ChatML compatibility is convenient.

Best For

  • Applications where system-prompt steerability is a feature — you want the model to take on specific personas, follow detailed behavioral rules, or operate as part of an agent framework with defined roles
  • Function calling and tool use — Nous has invested specifically in making these reliable, and the model ships with documented function-calling templates
  • Developers already building on OpenAI-compatible tooling who want a drop-in open-weightA model where the trained weights are freely downloadable — you can run it yourself without contacting the creator. Llama, Mistral, Qwen, and Gemma are open-weight. Open-weight does not mean open-source: the training data and code often stay private. The license still governs what you can do with the weights, including whether you can use them commercially. alternative with matching prompt format
  • Roleplay, interactive fiction, and creative applications where Meta's first-party alignment is too restrictive for the use case
  • Self-hostedRunning a model on hardware you control — your own servers, your own cloud instance, or your own laptop — rather than paying to access it through someone else's API. Self-hosting gives you full control over data and predictable costs, but requires the hardware and operational effort to run the model. Only possible with open-weight models. deployment where you want a fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. that's widely discussed and documented in the open-sourceA stricter standard than open-weight: the weights, the training code, and the training data are all released publicly. Very few large language models meet the full open-source bar — most "open" models in the AI world are actually open-weight. When in doubt, check the license file and the creator's documentation. community

Not For

  • Consumer-facing applications in regulated industries where you need the stricter refusal behavior of a first-party instruction-tuned model
  • Any use case where "Llama 3.1 8B was too small" — Hermes is the same model underneath, with the same capability ceiling
  • Teams who don't want to think about license inheritance — the Llama 3.1 Community License still governs this model, and you still need to display "Built with Llama" attribution
  • Organizations above 700M monthly active users (the underlying Llama license applies)
  • Businesses that need long-term support guarantees — Nous Research is a well-funded lab, but Hermes 3 is feature-frozen and the next-generation Hermes 4 series has already started shipping elsewhere

License — Plain-English Summary

The license situation here is actually simple, once you understand the inheritance: Nous Research publishes Hermes 3 under the Llama 3.1 Community License from Meta. They didn't add new restrictions; they didn't relicense it under something more permissive; the license is Meta's, and all of Meta's terms apply. That means: free for commercial use unless you had more than 700M monthly active users on July 23, 2024; must display "Built with Llama" attribution; must include the license file when redistributing; no using it to train non-Llama foundation models; standard prohibited uses (CSAM, illegal activity, military weapons development). For the vast majority of businesses, this is permissive commercial use.

How It Compares

  • Llama 3.1 8B Instruct (see Llama 3.1 8B Instruct — Meta's own fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. of the same base modelA model straight out of pretraining, before any fine-tuning for chat or specific tasks. Base models predict the next token but don't follow instructions well — they'll continue your prompt rather than respond to it. Most people never use base models directly; they use the instruct-tuned or chat versions built on top. Useful mostly for researchers and people doing their own fine-tuning.; tighter default alignment, same hardware, same license, Meta's own prompt format instead of ChatML)
  • Llama 3.3 70B Instruct (see Llama 3.3 70B Instruct — if Hermes 8B's capability ceiling is the issue, the larger base model is the answer, not a different fine-tune of the same base)
  • Other Hermes 3 variants (70B and 405B versions exist, built on Llama 3.1's larger models — same Nous Research post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. approach scaled up; same license inheritance pattern)

Under the Hood

Hermes 3 is a full-parameter fine-tuneA model that has been further trained on additional data to specialize it for a particular task, domain, or style. Fine-tuning a general model on medical literature produces a medical specialist; fine-tuning on your company's support tickets produces a support assistant that sounds like your team. Fine-tunes are much cheaper to create than training a model from scratch. of Llama 3.1 8B base, not a LoRAA lightweight fine-tuning method that adds a small number of new parameters to a frozen base model rather than retraining the whole thing. LoRA adapters are tiny (often a few hundred megabytes versus the base model's tens of gigabytes), fast to train, and can be swapped in and out. Useful when you want many specialized variants of the same base model without storing a full copy for each. or adapter. Training used Nous Research's own post-trainingAny training that happens after pretraining to make a base model useful for real tasks. Includes instruction tuning, chat tuning, and alignment work. Post-training is dramatically cheaper than pretraining — thousands to low millions rather than tens of millions. Most of what distinguishes GPT-4 from Llama 3.1 as a product, rather than as a base capability, is post-training. pipeline with a focus on instruction-following, agentic behaviors, ChatML formatting, and function calling. The model retains Llama 3.1's architectural details — dense decoderThe part of a model that generates output, one token at a time, from an internal representation. Chat models are almost all decoder-only architectures — they take your prompt, process it, and stream out a response token by token. "Decoder-only" is the technical name for the family most people just call "chatbots."-only transformerThe core model architecture that powers nearly every modern AI language model. Introduced by Google researchers in 2017, it uses a mechanism called attention to process text by looking at every word in context with every other word simultaneously, rather than one at a time. "Transformer" is the T in GPT, BERT, and most other model names., Grouped-Query AttentionThe mechanism inside a Transformer that lets the model weigh which parts of the input matter most when processing each word. When you read "the cat sat on the mat," attention is how the model knows that "it" in a later sentence refers back to the cat, not the mat. Attention is what made modern language models possible., 128K context windowThe maximum amount of text the model can "see" at once — prompt plus prior conversation plus any documents you give it. Measured in tokens (which are roughly three-quarters of a word each). A 128K context window is about 96,000 words of input — roughly a 400-page book. Larger context windows let the model work with bigger documents but cost more to run. — and the December 2023 pretrainingThe first and most expensive phase of training a model, where it learns general language and knowledge from enormous datasets — typically trillions of tokens of text scraped from the internet, books, code, and other sources. Pretraining produces a base model. Major labs spend millions to hundreds of millions of dollars on a single pretraining run. knowledge cutoff of the base. Native function calling is supported via a documented JSON schema approach (see Nous Research's Hermes-Function-Calling GitHub repository for current templates). ChatML is the default prompt format. Official GGUF quantizations are published by Nous Research themselves, making Ollama and llama.cpp deployment straightforward. The Hermes 3 Technical Report (arXiv:2408.11857) documents the training approach and evaluation methodology.

Cost

Self-hosted cost
$0.00 beyond compute
API providers
openrouter, lambda-labs, fireworks, together
Notes
API pricing varies by provider and is not consistently published for Hermes variants; check providers directly. Self-hosting is the more common deployment pattern for Hermes models. Official GGUF quantizations are published by Nous Research themselves for direct download.

Hardware requirements

Min VRAM
6 GB
Recommended VRAM
16 GB
Runs on laptop
Yes
Notes
Same hardware profile as Llama 3.1 8B — 4-bit quantized runs on 6GB cards, full precision wants ~16GB. Official GGUF quantizations from Nous Research are available for direct use in llama.cpp, Ollama, and LM Studio.

Comparable models

Commercial-use conditions

Licensing inherits directly from Meta's Llama 3.1 base model. Free for commercial use unless your product had more than 700 million monthly active users on July 23, 2024. Past that threshold, a separate Meta license is required. Nous Research has not added restrictions beyond the base Llama license.

Sources