TokenMix Research Lab · 2026-06-22

all-MiniLM-L6-v2: Free Local Embedding Model Guide 2026

all-MiniLM-L6-v2: Free Local Embedding Model Guide 2026

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Hugging Face sentence-transformers model card, sbert.net pretrained models docs, MTEB leaderboard, BAAI bge-small and gte-small cards, OpenAI embedding pricing, third-party RAG evaluations

all-MiniLM-L6-v2 is a free, Apache-2.0 sentence-embedding model that turns text into 384-dimension vectors, weighs about 23 million parameters, and runs locally on a laptop CPU with no API key (Hugging Face model card). It is the default embedding model in much of the open-source RAG stack — but it trails newer same-size models like bge-small and gte-small by roughly 5 to 6 MTEB points, caps usable input at 256 tokens, and a paid API like OpenAI's text-embedding-3-small scores higher out of the box (OpenAI embeddings).

This guide separates confirmed specs from approximate community figures and tags each claim Confirmed, Likely, or vendor-reported. Core specifications come from the official Hugging Face model card; comparison scores come from the MTEB leaderboard and individual model cards, which do not all use the same MTEB version, so treat cross-model point gaps as directional rather than exact.

Table of Contents

Quick Verdict

all-MiniLM-L6-v2 is the best free, fast, runs-anywhere embedding model for getting a RAG prototype working, and the one to outgrow once retrieval quality matters. It costs nothing and is small; it is also a few points behind every newer alternative.

Claim Status Source
Outputs 384-dimension embeddings Confirmed HF model card
~22.7M parameters Confirmed HF model card
Apache-2.0, free for commercial use Confirmed HF model card
Caps usable input at 256 word-pieces Confirmed HF model card
Runs locally on CPU, no API key Confirmed sbert.net docs
~5x faster than all-mpnet-base-v2 Confirmed sbert.net docs
MTEB average around 56 Likely (aggregator) Galileo RAG guide
Beaten by bge-small and gte-small on MTEB Confirmed bge-small card, gte-small card
It is a paid hosted API model False It is a self-hosted open model, no per-token cost
TokenMix serves all-MiniLM-L6-v2 False Self-hosted models are not relayed; TokenMix lists hosted API models (TokenMix models)

The short answer: use all-MiniLM-L6-v2 when free, local, and fast beat last-mile accuracy. Switch to bge-small, gte-small, or a paid API like text-embedding-3-small when retrieval quality starts costing you answers.

What all-MiniLM-L6-v2 Is

all-MiniLM-L6-v2 is a sentence-transformer model that maps a sentence or short paragraph to a single 384-dimension vector for semantic search, clustering, and RAG retrieval. It was built by the sentence-transformers team, distilled from a 12-layer BERT teacher into a 6-layer MiniLM student, and trained on more than one billion sentence pairs across 28-plus datasets including Reddit comments, S2ORC, MS MARCO, and Stack Exchange (Hugging Face model card).

The reason it became the default is simple: it is tiny, fast, permissively licensed, and good enough. It ships as the out-of-the-box choice in many LangChain, Chroma, and sentence-transformers tutorials, which is why "all-minilm-l6-v2" gets thousands of monthly searches even though newer models score higher. For the paid-API side of this decision, see the text-embedding-3-small developer guide and the full text embedding models comparison.

Specifications

The spec sheet is the model's whole pitch: small, normalized, and cheap to run. Every figure below is from the official model card unless noted.

Field Value Status
Embedding dimension 384 Confirmed
Parameters ~22.7M Confirmed
Base architecture MiniLM, 6 layers (distilled from BERT) Confirmed
Max input 256 word-pieces (longer is truncated) Confirmed
Output L2-normalized (cosine-ready) Confirmed
License Apache-2.0 Confirmed
Training data 1B+ sentence pairs, 28+ datasets Confirmed
Model size on disk ~90 MB Likely (aggregator)
Languages English-focused Confirmed

One trap to flag: the tokenizer config technically allows 512 tokens, but the model card states the model truncates beyond 256, and it was trained at 128, so treat 256 as the real ceiling and chunk longer documents accordingly. The L2-normalized output means you can use a plain dot product as cosine similarity, which simplifies vector-DB setup.

Performance and Benchmarks

On quality, all-MiniLM-L6-v2 is a solid efficiency-class baseline, not a leader, and the honest framing is "fast and free, a few points behind." The headline trade is speed: around 14,000 sentences per second on a GPU, roughly 5x faster than the heavier all-mpnet-base-v2 (sbert.net docs).

Metric Value Status Source
MTEB average ~56 Likely Galileo RAG guide
ArguAna (retrieval) 50.17 Confirmed HF model card
Encoding speed (GPU) ~14,000 sentences/sec Confirmed sbert.net docs
Speed vs all-mpnet-base-v2 ~5x faster Confirmed sbert.net docs
STS quality vs mpnet ~84-85% vs ~87-88% Likely Milvus comparison

The MTEB average near 56 is an aggregator figure, not on the official card, so read it as approximate. The pattern holds across sources: all-MiniLM gives up some retrieval accuracy for a large speed and size advantage. For most prototypes that trade is correct; for production retrieval where a missed chunk means a wrong answer, the gap matters.

all-MiniLM vs bge-small vs gte-small vs OpenAI

Compared head-to-head with its natural rivals, all-MiniLM-L6-v2 is the smallest and fastest but the lowest scoring, and the only one that is both free and beaten by a same-dimension peer. All four below output small vectors; three are free and self-hosted, one is a paid API.

Model Dims Params Max tokens MTEB avg Cost
all-MiniLM-L6-v2 384 22.7M 256 ~56 Free / self-host
gte-small 384 33.4M 512 61.36 Free / self-host
bge-small-en-v1.5 384 33.4M 512 62.17 Free / self-host
text-embedding-3-small 1536 (to 256) n/a 8191 62.3 $0.02 / 1M tokens

The cleanest upgrade inside the free tier is bge-small or gte-small: same 384 dimensions, but roughly 5 to 6 MTEB points higher and double the context at 512 tokens, for a modest size increase (bge-small card, gte-small card). If you want the highest quality and do not mind paying, text-embedding-3-small tops the group and handles 8,191 tokens of context. See the bge embeddings guide and Gemini vs OpenAI embeddings for the deeper cuts.

Cost: Free Local vs Paid API

all-MiniLM-L6-v2 has a $0 per-token cost, which is its single biggest advantage at scale; the only cost is the compute you already run. The comparison that matters is your own hardware versus a paid embedding API's per-token bill.

Workload all-MiniLM (local) text-embedding-3-small (API)
Embed 1M chunks (~100 tokens each) $0 + compute ~$2.00
Embed 10M docs (~100 tokens each) $0 + compute ~$20.00
Embed 1B tokens total $0 + compute ~$20.00
Re-embed entire corpus monthly $0 + compute recurring API bill

The math is blunt: embedding one billion tokens costs about $20 on text-embedding-3-small and $0 in API fees on all-MiniLM, paid only in your own CPU or GPU time. For privacy-sensitive data, air-gapped systems, or high-volume re-embedding pipelines, free local inference is the rational default. The counterpoint: at small scale, a few dollars of API spend buys higher accuracy and zero infrastructure, so the local model only wins clearly once volume or privacy is the binding constraint.

How to Use It

Getting embeddings out of all-MiniLM-L6-v2 takes three lines of Python, and there are lighter runtimes for CPU and browser. The default path is the sentence-transformers library.

Path What it gives you Best for
sentence-transformers encode() one-liner, pooling handled Most Python projects
transformers Manual mean-pooling + L2 norm Custom pipelines
ONNX / onnxruntime Lightweight CPU and browser inference Edge, serverless, JS
GGUF / CT2 builds Quantized local runners Constrained hardware
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([
    "How do I reset my API key?",
    "Where can I rotate credentials?",
])
# vectors.shape -> (2, 384), already L2-normalized

For a JavaScript or serverless stack, the ONNX build via transformers.js (Xenova/all-MiniLM-L6-v2) runs the same model in-browser or in an edge function with no Python at all. Store the 384-dim vectors in any vector database and use cosine or dot-product similarity.

When to Use It and When to Upgrade

Use all-MiniLM-L6-v2 by default for prototypes and low-stakes retrieval, and upgrade on a clear trigger rather than by reflex. The decision is mostly about whether retrieval accuracy is currently costing you correct answers.

Situation Recommendation
RAG prototype, small corpus Use all-MiniLM-L6-v2
Latency-critical, high QPS Use all-MiniLM-L6-v2
Privacy / air-gapped data Use all-MiniLM-L6-v2 (local)
Retrieval misses hurting answers Upgrade to bge-small / gte-small
Long documents (>256 tokens) Upgrade (512-token model)
Need best accuracy, can pay Use text-embedding-3-small
Multilingual retrieval Use a multilingual model (e.g. bge-m3)

The honest rule: start free and local, measure retrieval quality on your own data, and only pay or switch when the numbers say the misses are real. Many teams never need to upgrade; the ones that do usually hit either the 256-token ceiling or an accuracy wall on domain-specific text.

Where all-MiniLM-L6-v2 Loses

all-MiniLM-L6-v2 loses on accuracy ceiling, context length, and multilingual coverage. These are the predictable costs of a 23M-parameter English model.

Weak spot Evidence Pick instead
Lower MTEB than peers ~56 vs 62+ for bge/gte/OpenAI bge-small-en-v1.5, gte-small
256-token input cap Model card truncation note 512-token or 8K-token model
English-focused Trained mainly on English bge-m3 or multilingual-e5
Older architecture 2021-era MiniLM base Newer small embedding models
No managed API Self-host only text-embedding-3-small for hands-off

None of these make it a bad choice; they make it a starting choice. The model's job is to get you to a working retrieval pipeline at zero cost, and it does that better than almost anything. The day retrieval quality, document length, or another language becomes the bottleneck is the day to move.

Use Case Matrix

Match the model to the stakes: prototypes and high-volume jobs favor all-MiniLM, accuracy-critical production favors an upgrade.

Use case all-MiniLM fit Better alternative Why
RAG prototype Strong none needed Free, fast, three lines of code
Semantic search (English) Strong bge-small if accuracy matters Good baseline retrieval
High-volume embedding pipeline Strong none needed $0 per-token at scale
Edge / browser inference Strong none needed ONNX build is tiny
Domain-specific retrieval Medium bge-small / fine-tuned model Accuracy ceiling shows
Long-document chunks Weak 512-token model 256-token cap
Multilingual app Weak bge-m3 / multilingual-e5 English-focused training
Hands-off managed API Weak text-embedding-3-small No infra, higher score

If your real problem is choosing and routing across many AI models and APIs rather than one embedding model, pair this with the text embedding models comparison and the AI API gateway guide.

Final Recommendation

Use all-MiniLM-L6-v2 as the default for prototypes, latency-critical retrieval, privacy-bound data, and high-volume embedding where its $0 per-token cost compounds. Upgrade within the free tier to bge-small or gte-small the moment retrieval accuracy or the 256-token cap starts hurting, and move to a paid API like text-embedding-3-small when you want the highest score with zero infrastructure. It is the right place to start and the wrong place to stop if quality becomes the bottleneck.

FAQ

What is all-MiniLM-L6-v2 used for?

It converts sentences or short paragraphs into 384-dimension vectors for semantic search, retrieval-augmented generation, clustering, and sentence similarity. It is one of the most widely used default embedding models in open-source RAG stacks.

Is all-MiniLM-L6-v2 free?

Yes. It is released under the Apache-2.0 license, free for commercial use, and runs locally with no API key or per-token cost. Your only cost is the compute you run it on.

How many dimensions does all-MiniLM-L6-v2 output?

384 dimensions. The vectors are L2-normalized, so you can use a dot product directly as cosine similarity in any vector database.

What is the max input length for all-MiniLM-L6-v2?

The model card states it truncates input beyond 256 word-pieces, and it was trained at 128 tokens. Although the tokenizer config allows 512, treat 256 as the practical ceiling and chunk longer text.

Is all-MiniLM-L6-v2 better than OpenAI embeddings?

No, not on accuracy. OpenAI's text-embedding-3-small scores higher on MTEB (about 62 vs 56) and handles far longer context. all-MiniLM wins on cost (free) and on local, private, low-latency inference.

all-MiniLM-L6-v2 vs bge-small: which should I use?

bge-small-en-v1.5 scores about 5 to 6 MTEB points higher at the same 384 dimensions and supports 512-token input, for a small size increase. Use all-MiniLM for maximum speed and minimum size; use bge-small when retrieval accuracy matters more.

Can I run all-MiniLM-L6-v2 in the browser?

Yes. The ONNX build via transformers.js runs the model in-browser or in serverless edge functions with no Python, which is one reason it is popular for lightweight and client-side semantic search.

Does TokenMix provide all-MiniLM-L6-v2?

No. all-MiniLM is a self-hosted open model, not a relayed API model. TokenMix routes hosted LLM, image, and video APIs through one endpoint; for embeddings you would self-host all-MiniLM or call a hosted embedding API.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. all-MiniLM-L6-v2 is a free self-hosted model rather than a relayed API, so this guide is published as independent model intelligence.

Sources

Related Articles