TokenMix Research Lab · 2026-06-22

all-MiniLM-L6-v2: Free Local Embedding Model Guide 2026
Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Hugging Face sentence-transformers model card, sbert.net pretrained models docs, MTEB leaderboard, BAAI bge-small and gte-small cards, OpenAI embedding pricing, third-party RAG evaluations
all-MiniLM-L6-v2 is a free, Apache-2.0 sentence-embedding model that turns text into 384-dimension vectors, weighs about 23 million parameters, and runs locally on a laptop CPU with no API key (Hugging Face model card). It is the default embedding model in much of the open-source RAG stack — but it trails newer same-size models like bge-small and gte-small by roughly 5 to 6 MTEB points, caps usable input at 256 tokens, and a paid API like OpenAI's text-embedding-3-small scores higher out of the box (OpenAI embeddings).
This guide separates confirmed specs from approximate community figures and tags each claim Confirmed, Likely, or vendor-reported. Core specifications come from the official Hugging Face model card; comparison scores come from the MTEB leaderboard and individual model cards, which do not all use the same MTEB version, so treat cross-model point gaps as directional rather than exact.
Table of Contents
- Quick Verdict
- What all-MiniLM-L6-v2 Is
- Specifications
- Performance and Benchmarks
- all-MiniLM vs bge-small vs gte-small vs OpenAI
- Cost: Free Local vs Paid API
- How to Use It
- When to Use It and When to Upgrade
- Where all-MiniLM-L6-v2 Loses
- Use Case Matrix
- Final Recommendation
- FAQ
- About TokenMix
- Sources
- Related Articles
Quick Verdict
all-MiniLM-L6-v2 is the best free, fast, runs-anywhere embedding model for getting a RAG prototype working, and the one to outgrow once retrieval quality matters. It costs nothing and is small; it is also a few points behind every newer alternative.
| Claim | Status | Source |
|---|---|---|
| Outputs 384-dimension embeddings | Confirmed | HF model card |
| ~22.7M parameters | Confirmed | HF model card |
| Apache-2.0, free for commercial use | Confirmed | HF model card |
| Caps usable input at 256 word-pieces | Confirmed | HF model card |
| Runs locally on CPU, no API key | Confirmed | sbert.net docs |
| ~5x faster than all-mpnet-base-v2 | Confirmed | sbert.net docs |
| MTEB average around 56 | Likely (aggregator) | Galileo RAG guide |
| Beaten by bge-small and gte-small on MTEB | Confirmed | bge-small card, gte-small card |
| It is a paid hosted API model | False | It is a self-hosted open model, no per-token cost |
| TokenMix serves all-MiniLM-L6-v2 | False | Self-hosted models are not relayed; TokenMix lists hosted API models (TokenMix models) |
The short answer: use all-MiniLM-L6-v2 when free, local, and fast beat last-mile accuracy. Switch to bge-small, gte-small, or a paid API like text-embedding-3-small when retrieval quality starts costing you answers.
What all-MiniLM-L6-v2 Is
all-MiniLM-L6-v2 is a sentence-transformer model that maps a sentence or short paragraph to a single 384-dimension vector for semantic search, clustering, and RAG retrieval. It was built by the sentence-transformers team, distilled from a 12-layer BERT teacher into a 6-layer MiniLM student, and trained on more than one billion sentence pairs across 28-plus datasets including Reddit comments, S2ORC, MS MARCO, and Stack Exchange (Hugging Face model card).
The reason it became the default is simple: it is tiny, fast, permissively licensed, and good enough. It ships as the out-of-the-box choice in many LangChain, Chroma, and sentence-transformers tutorials, which is why "all-minilm-l6-v2" gets thousands of monthly searches even though newer models score higher. For the paid-API side of this decision, see the text-embedding-3-small developer guide and the full text embedding models comparison.
Specifications
The spec sheet is the model's whole pitch: small, normalized, and cheap to run. Every figure below is from the official model card unless noted.
| Field | Value | Status |
|---|---|---|
| Embedding dimension | 384 | Confirmed |
| Parameters | ~22.7M | Confirmed |
| Base architecture | MiniLM, 6 layers (distilled from BERT) | Confirmed |
| Max input | 256 word-pieces (longer is truncated) | Confirmed |
| Output | L2-normalized (cosine-ready) | Confirmed |
| License | Apache-2.0 | Confirmed |
| Training data | 1B+ sentence pairs, 28+ datasets | Confirmed |
| Model size on disk | ~90 MB | Likely (aggregator) |
| Languages | English-focused | Confirmed |
One trap to flag: the tokenizer config technically allows 512 tokens, but the model card states the model truncates beyond 256, and it was trained at 128, so treat 256 as the real ceiling and chunk longer documents accordingly. The L2-normalized output means you can use a plain dot product as cosine similarity, which simplifies vector-DB setup.
Performance and Benchmarks
On quality, all-MiniLM-L6-v2 is a solid efficiency-class baseline, not a leader, and the honest framing is "fast and free, a few points behind." The headline trade is speed: around 14,000 sentences per second on a GPU, roughly 5x faster than the heavier all-mpnet-base-v2 (sbert.net docs).
| Metric | Value | Status | Source |
|---|---|---|---|
| MTEB average | ~56 | Likely | Galileo RAG guide |
| ArguAna (retrieval) | 50.17 | Confirmed | HF model card |
| Encoding speed (GPU) | ~14,000 sentences/sec | Confirmed | sbert.net docs |
| Speed vs all-mpnet-base-v2 | ~5x faster | Confirmed | sbert.net docs |
| STS quality vs mpnet | ~84-85% vs ~87-88% | Likely | Milvus comparison |
The MTEB average near 56 is an aggregator figure, not on the official card, so read it as approximate. The pattern holds across sources: all-MiniLM gives up some retrieval accuracy for a large speed and size advantage. For most prototypes that trade is correct; for production retrieval where a missed chunk means a wrong answer, the gap matters.
all-MiniLM vs bge-small vs gte-small vs OpenAI
Compared head-to-head with its natural rivals, all-MiniLM-L6-v2 is the smallest and fastest but the lowest scoring, and the only one that is both free and beaten by a same-dimension peer. All four below output small vectors; three are free and self-hosted, one is a paid API.
| Model | Dims | Params | Max tokens | MTEB avg | Cost |
|---|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 22.7M | 256 | ~56 | Free / self-host |
| gte-small | 384 | 33.4M | 512 | 61.36 | Free / self-host |
| bge-small-en-v1.5 | 384 | 33.4M | 512 | 62.17 | Free / self-host |
| text-embedding-3-small | 1536 (to 256) | n/a | 8191 | 62.3 | $0.02 / 1M tokens |
The cleanest upgrade inside the free tier is bge-small or gte-small: same 384 dimensions, but roughly 5 to 6 MTEB points higher and double the context at 512 tokens, for a modest size increase (bge-small card, gte-small card). If you want the highest quality and do not mind paying, text-embedding-3-small tops the group and handles 8,191 tokens of context. See the bge embeddings guide and Gemini vs OpenAI embeddings for the deeper cuts.
Cost: Free Local vs Paid API
all-MiniLM-L6-v2 has a $0 per-token cost, which is its single biggest advantage at scale; the only cost is the compute you already run. The comparison that matters is your own hardware versus a paid embedding API's per-token bill.
| Workload | all-MiniLM (local) | text-embedding-3-small (API) |
|---|---|---|
| Embed 1M chunks (~100 tokens each) | $0 + compute | ~$2.00 |
| Embed 10M docs (~100 tokens each) | $0 + compute | ~$20.00 |
| Embed 1B tokens total | $0 + compute | ~$20.00 |
| Re-embed entire corpus monthly | $0 + compute | recurring API bill |
The math is blunt: embedding one billion tokens costs about $20 on text-embedding-3-small and $0 in API fees on all-MiniLM, paid only in your own CPU or GPU time. For privacy-sensitive data, air-gapped systems, or high-volume re-embedding pipelines, free local inference is the rational default. The counterpoint: at small scale, a few dollars of API spend buys higher accuracy and zero infrastructure, so the local model only wins clearly once volume or privacy is the binding constraint.
How to Use It
Getting embeddings out of all-MiniLM-L6-v2 takes three lines of Python, and there are lighter runtimes for CPU and browser. The default path is the sentence-transformers library.
| Path | What it gives you | Best for |
|---|---|---|
| sentence-transformers | encode() one-liner, pooling handled |
Most Python projects |
| transformers | Manual mean-pooling + L2 norm | Custom pipelines |
| ONNX / onnxruntime | Lightweight CPU and browser inference | Edge, serverless, JS |
| GGUF / CT2 builds | Quantized local runners | Constrained hardware |
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([
"How do I reset my API key?",
"Where can I rotate credentials?",
])
# vectors.shape -> (2, 384), already L2-normalized
For a JavaScript or serverless stack, the ONNX build via transformers.js (Xenova/all-MiniLM-L6-v2) runs the same model in-browser or in an edge function with no Python at all. Store the 384-dim vectors in any vector database and use cosine or dot-product similarity.
When to Use It and When to Upgrade
Use all-MiniLM-L6-v2 by default for prototypes and low-stakes retrieval, and upgrade on a clear trigger rather than by reflex. The decision is mostly about whether retrieval accuracy is currently costing you correct answers.
| Situation | Recommendation |
|---|---|
| RAG prototype, small corpus | Use all-MiniLM-L6-v2 |
| Latency-critical, high QPS | Use all-MiniLM-L6-v2 |
| Privacy / air-gapped data | Use all-MiniLM-L6-v2 (local) |
| Retrieval misses hurting answers | Upgrade to bge-small / gte-small |
| Long documents (>256 tokens) | Upgrade (512-token model) |
| Need best accuracy, can pay | Use text-embedding-3-small |
| Multilingual retrieval | Use a multilingual model (e.g. bge-m3) |
The honest rule: start free and local, measure retrieval quality on your own data, and only pay or switch when the numbers say the misses are real. Many teams never need to upgrade; the ones that do usually hit either the 256-token ceiling or an accuracy wall on domain-specific text.
Where all-MiniLM-L6-v2 Loses
all-MiniLM-L6-v2 loses on accuracy ceiling, context length, and multilingual coverage. These are the predictable costs of a 23M-parameter English model.
| Weak spot | Evidence | Pick instead |
|---|---|---|
| Lower MTEB than peers | ~56 vs 62+ for bge/gte/OpenAI | bge-small-en-v1.5, gte-small |
| 256-token input cap | Model card truncation note | 512-token or 8K-token model |
| English-focused | Trained mainly on English | bge-m3 or multilingual-e5 |
| Older architecture | 2021-era MiniLM base | Newer small embedding models |
| No managed API | Self-host only | text-embedding-3-small for hands-off |
None of these make it a bad choice; they make it a starting choice. The model's job is to get you to a working retrieval pipeline at zero cost, and it does that better than almost anything. The day retrieval quality, document length, or another language becomes the bottleneck is the day to move.
Use Case Matrix
Match the model to the stakes: prototypes and high-volume jobs favor all-MiniLM, accuracy-critical production favors an upgrade.
| Use case | all-MiniLM fit | Better alternative | Why |
|---|---|---|---|
| RAG prototype | Strong | none needed | Free, fast, three lines of code |
| Semantic search (English) | Strong | bge-small if accuracy matters | Good baseline retrieval |
| High-volume embedding pipeline | Strong | none needed | $0 per-token at scale |
| Edge / browser inference | Strong | none needed | ONNX build is tiny |
| Domain-specific retrieval | Medium | bge-small / fine-tuned model | Accuracy ceiling shows |
| Long-document chunks | Weak | 512-token model | 256-token cap |
| Multilingual app | Weak | bge-m3 / multilingual-e5 | English-focused training |
| Hands-off managed API | Weak | text-embedding-3-small | No infra, higher score |
If your real problem is choosing and routing across many AI models and APIs rather than one embedding model, pair this with the text embedding models comparison and the AI API gateway guide.
Final Recommendation
Use all-MiniLM-L6-v2 as the default for prototypes, latency-critical retrieval, privacy-bound data, and high-volume embedding where its $0 per-token cost compounds. Upgrade within the free tier to bge-small or gte-small the moment retrieval accuracy or the 256-token cap starts hurting, and move to a paid API like text-embedding-3-small when you want the highest score with zero infrastructure. It is the right place to start and the wrong place to stop if quality becomes the bottleneck.
FAQ
What is all-MiniLM-L6-v2 used for?
It converts sentences or short paragraphs into 384-dimension vectors for semantic search, retrieval-augmented generation, clustering, and sentence similarity. It is one of the most widely used default embedding models in open-source RAG stacks.
Is all-MiniLM-L6-v2 free?
Yes. It is released under the Apache-2.0 license, free for commercial use, and runs locally with no API key or per-token cost. Your only cost is the compute you run it on.
How many dimensions does all-MiniLM-L6-v2 output?
384 dimensions. The vectors are L2-normalized, so you can use a dot product directly as cosine similarity in any vector database.
What is the max input length for all-MiniLM-L6-v2?
The model card states it truncates input beyond 256 word-pieces, and it was trained at 128 tokens. Although the tokenizer config allows 512, treat 256 as the practical ceiling and chunk longer text.
Is all-MiniLM-L6-v2 better than OpenAI embeddings?
No, not on accuracy. OpenAI's text-embedding-3-small scores higher on MTEB (about 62 vs 56) and handles far longer context. all-MiniLM wins on cost (free) and on local, private, low-latency inference.
all-MiniLM-L6-v2 vs bge-small: which should I use?
bge-small-en-v1.5 scores about 5 to 6 MTEB points higher at the same 384 dimensions and supports 512-token input, for a small size increase. Use all-MiniLM for maximum speed and minimum size; use bge-small when retrieval accuracy matters more.
Can I run all-MiniLM-L6-v2 in the browser?
Yes. The ONNX build via transformers.js runs the model in-browser or in serverless edge functions with no Python, which is one reason it is popular for lightweight and client-side semantic search.
Does TokenMix provide all-MiniLM-L6-v2?
No. all-MiniLM is a self-hosted open model, not a relayed API model. TokenMix routes hosted LLM, image, and video APIs through one endpoint; for embeddings you would self-host all-MiniLM or call a hosted embedding API.
About TokenMix
TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. all-MiniLM-L6-v2 is a free self-hosted model rather than a relayed API, so this guide is published as independent model intelligence.
Sources
- Hugging Face - sentence-transformers/all-MiniLM-L6-v2 model card - dimensions, params, max length, license, training data
- Sentence-Transformers docs - pretrained models - speed and size comparison
- Hugging Face - BAAI/bge-small-en-v1.5 - same-dimension upgrade, MTEB score
- Hugging Face - thenlper/gte-small - same-dimension upgrade, MTEB score
- OpenAI - new embedding models and API updates - text-embedding-3-small pricing and score
- Galileo - mastering RAG, selecting an embedding model - MTEB aggregation and RAG attribution lift
- Milvus - MiniLM vs mpnet comparison - quality vs speed trade
- Hugging Face - Xenova/all-MiniLM-L6-v2 (ONNX) - browser and serverless runtime