TokenMix Research Lab · 2026-06-22

all-MiniLM-L6-v2: Free Local Embedding Model Guide 2026

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Hugging Face sentence-transformers model card, sbert.net pretrained models docs, MTEB leaderboard, BAAI bge-small and gte-small cards, OpenAI embedding pricing, third-party RAG evaluations

all-MiniLM-L6-v2 is a free, Apache-2.0 sentence-embedding model that turns text into 384-dimension vectors, weighs about 23 million parameters, and runs locally on a laptop CPU with no API key (Hugging Face model card). It is the default embedding model in much of the open-source RAG stack — but it trails newer same-size models like bge-small and gte-small by roughly 5 to 6 MTEB points, caps usable input at 256 tokens, and a paid API like OpenAI's text-embedding-3-small scores higher out of the box (OpenAI embeddings).

This guide separates confirmed specs from approximate community figures and tags each claim Confirmed, Likely, or vendor-reported. Core specifications come from the official Hugging Face model card; comparison scores come from the MTEB leaderboard and individual model cards, which do not all use the same MTEB version, so treat cross-model point gaps as directional rather than exact.

Quick Verdict
What all-MiniLM-L6-v2 Is
Specifications
Performance and Benchmarks
all-MiniLM vs bge-small vs gte-small vs OpenAI
Cost: Free Local vs Paid API
How to Use It
When to Use It and When to Upgrade
Where all-MiniLM-L6-v2 Loses
Use Case Matrix
Final Recommendation
FAQ
About TokenMix
Sources
Related Articles

Quick Verdict

all-MiniLM-L6-v2 is the best free, fast, runs-anywhere embedding model for getting a RAG prototype working, and the one to outgrow once retrieval quality matters. It costs nothing and is small; it is also a few points behind every newer alternative.

Claim	Status	Source
Outputs 384-dimension embeddings	Confirmed	HF model card
~22.7M parameters	Confirmed	HF model card
Apache-2.0, free for commercial use	Confirmed	HF model card
Caps usable input at 256 word-pieces	Confirmed	HF model card
Runs locally on CPU, no API key	Confirmed	sbert.net docs
~5x faster than all-mpnet-base-v2	Confirmed	sbert.net docs
MTEB average around 56	Likely (aggregator)	Galileo RAG guide
Beaten by bge-small and gte-small on MTEB	Confirmed	bge-small card, gte-small card
It is a paid hosted API model	False	It is a self-hosted open model, no per-token cost
TokenMix serves all-MiniLM-L6-v2	False	Self-hosted models are not relayed; TokenMix lists hosted API models (TokenMix models)

The short answer: use all-MiniLM-L6-v2 when free, local, and fast beat last-mile accuracy. Switch to bge-small, gte-small, or a paid API like text-embedding-3-small when retrieval quality starts costing you answers.

What all-MiniLM-L6-v2 Is

all-MiniLM-L6-v2 is a sentence-transformer model that maps a sentence or short paragraph to a single 384-dimension vector for semantic search, clustering, and RAG retrieval. It was built by the sentence-transformers team, distilled from a 12-layer BERT teacher into a 6-layer MiniLM student, and trained on more than one billion sentence pairs across 28-plus datasets including Reddit comments, S2ORC, MS MARCO, and Stack Exchange (Hugging Face model card).

The reason it became the default is simple: it is tiny, fast, permissively licensed, and good enough. It ships as the out-of-the-box choice in many LangChain, Chroma, and sentence-transformers tutorials, which is why "all-minilm-l6-v2" gets thousands of monthly searches even though newer models score higher. For the paid-API side of this decision, see the text-embedding-3-small developer guide and the full text embedding models comparison.

Specifications

The spec sheet is the model's whole pitch: small, normalized, and cheap to run. Every figure below is from the official model card unless noted.

Field	Value	Status
Embedding dimension	384	Confirmed
Parameters	~22.7M	Confirmed
Base architecture	MiniLM, 6 layers (distilled from BERT)	Confirmed
Max input	256 word-pieces (longer is truncated)	Confirmed
Output	L2-normalized (cosine-ready)	Confirmed
License	Apache-2.0	Confirmed
Training data	1B+ sentence pairs, 28+ datasets	Confirmed
Model size on disk	~90 MB	Likely (aggregator)
Languages	English-focused	Confirmed

One trap to flag: the tokenizer config technically allows 512 tokens, but the model card states the model truncates beyond 256, and it was trained at 128, so treat 256 as the real ceiling and chunk longer documents accordingly. The L2-normalized output means you can use a plain dot product as cosine similarity, which simplifies vector-DB setup.

Performance and Benchmarks

On quality, all-MiniLM-L6-v2 is a solid efficiency-class baseline, not a leader, and the honest framing is "fast and free, a few points behind." The headline trade is speed: around 14,000 sentences per second on a GPU, roughly 5x faster than the heavier all-mpnet-base-v2 (sbert.net docs).

Metric	Value	Status	Source
MTEB average	~56	Likely	Galileo RAG guide
ArguAna (retrieval)	50.17	Confirmed	HF model card
Encoding speed (GPU)	~14,000 sentences/sec	Confirmed	sbert.net docs
Speed vs all-mpnet-base-v2	~5x faster	Confirmed	sbert.net docs
STS quality vs mpnet	~84-85% vs ~87-88%	Likely	Milvus comparison

The MTEB average near 56 is an aggregator figure, not on the official card, so read it as approximate. The pattern holds across sources: all-MiniLM gives up some retrieval accuracy for a large speed and size advantage. For most prototypes that trade is correct; for production retrieval where a missed chunk means a wrong answer, the gap matters.

all-MiniLM vs bge-small vs gte-small vs OpenAI

Compared head-to-head with its natural rivals, all-MiniLM-L6-v2 is the smallest and fastest but the lowest scoring, and the only one that is both free and beaten by a same-dimension peer. All four below output small vectors; three are free and self-hosted, one is a paid API.

Model	Dims	Params	Max tokens	MTEB avg	Cost
all-MiniLM-L6-v2	384	22.7M	256	~56	Free / self-host
gte-small	384	33.4M	512	61.36	Free / self-host
bge-small-en-v1.5	384	33.4M	512	62.17	Free / self-host
text-embedding-3-small	1536 (to 256)	n/a	8191	62.3	$0.02 / 1M tokens

The cleanest upgrade inside the free tier is bge-small or gte-small: same 384 dimensions, but roughly 5 to 6 MTEB points higher and double the context at 512 tokens, for a modest size increase (bge-small card, gte-small card). If you want the highest quality and do not mind paying, text-embedding-3-small tops the group and handles 8,191 tokens of context. See the bge embeddings guide and Gemini vs OpenAI embeddings for the deeper cuts.

Cost: Free Local vs Paid API

all-MiniLM-L6-v2 has a $0 per-token cost, which is its single biggest advantage at scale; the only cost is the compute you already run. The comparison that matters is your own hardware versus a paid embedding API's per-token bill.

Workload	all-MiniLM (local)	text-embedding-3-small (API)
Embed 1M chunks (~100 tokens each)	$0 + compute	~$2.00
Embed 10M docs (~100 tokens each)	$0 + compute	~$20.00
Embed 1B tokens total	$0 + compute	~$20.00
Re-embed entire corpus monthly	$0 + compute	recurring API bill

The math is blunt: embedding one billion tokens costs about $20 on text-embedding-3-small and $0 in API fees on all-MiniLM, paid only in your own CPU or GPU time. For privacy-sensitive data, air-gapped systems, or high-volume re-embedding pipelines, free local inference is the rational default. The counterpoint: at small scale, a few dollars of API spend buys higher accuracy and zero infrastructure, so the local model only wins clearly once volume or privacy is the binding constraint.

How to Use It

Getting embeddings out of all-MiniLM-L6-v2 takes three lines of Python, and there are lighter runtimes for CPU and browser. The default path is the sentence-transformers library.

Path	What it gives you	Best for
sentence-transformers	`encode()` one-liner, pooling handled	Most Python projects
transformers	Manual mean-pooling + L2 norm	Custom pipelines
ONNX / onnxruntime	Lightweight CPU and browser inference	Edge, serverless, JS
GGUF / CT2 builds	Quantized local runners	Constrained hardware

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
vectors = model.encode([
    "How do I reset my API key?",
    "Where can I rotate credentials?",
])
# vectors.shape -> (2, 384), already L2-normalized

For a JavaScript or serverless stack, the ONNX build via transformers.js (Xenova/all-MiniLM-L6-v2) runs the same model in-browser or in an edge function with no Python at all. Store the 384-dim vectors in any vector database and use cosine or dot-product similarity.

When to Use It and When to Upgrade

Use all-MiniLM-L6-v2 by default for prototypes and low-stakes retrieval, and upgrade on a clear trigger rather than by reflex. The decision is mostly about whether retrieval accuracy is currently costing you correct answers.

Situation	Recommendation
RAG prototype, small corpus	Use all-MiniLM-L6-v2
Latency-critical, high QPS	Use all-MiniLM-L6-v2
Privacy / air-gapped data	Use all-MiniLM-L6-v2 (local)
Retrieval misses hurting answers	Upgrade to bge-small / gte-small
Long documents (>256 tokens)	Upgrade (512-token model)
Need best accuracy, can pay	Use text-embedding-3-small
Multilingual retrieval	Use a multilingual model (e.g. bge-m3)

The honest rule: start free and local, measure retrieval quality on your own data, and only pay or switch when the numbers say the misses are real. Many teams never need to upgrade; the ones that do usually hit either the 256-token ceiling or an accuracy wall on domain-specific text.

Where all-MiniLM-L6-v2 Loses

all-MiniLM-L6-v2 loses on accuracy ceiling, context length, and multilingual coverage. These are the predictable costs of a 23M-parameter English model.

Weak spot	Evidence	Pick instead
Lower MTEB than peers	~56 vs 62+ for bge/gte/OpenAI	bge-small-en-v1.5, gte-small
256-token input cap	Model card truncation note	512-token or 8K-token model
English-focused	Trained mainly on English	bge-m3 or multilingual-e5
Older architecture	2021-era MiniLM base	Newer small embedding models
No managed API	Self-host only	text-embedding-3-small for hands-off

None of these make it a bad choice; they make it a starting choice. The model's job is to get you to a working retrieval pipeline at zero cost, and it does that better than almost anything. The day retrieval quality, document length, or another language becomes the bottleneck is the day to move.

Use Case Matrix

Match the model to the stakes: prototypes and high-volume jobs favor all-MiniLM, accuracy-critical production favors an upgrade.

Use case	all-MiniLM fit	Better alternative	Why
RAG prototype	Strong	none needed	Free, fast, three lines of code
Semantic search (English)	Strong	bge-small if accuracy matters	Good baseline retrieval
High-volume embedding pipeline	Strong	none needed	$0 per-token at scale
Edge / browser inference	Strong	none needed	ONNX build is tiny
Domain-specific retrieval	Medium	bge-small / fine-tuned model	Accuracy ceiling shows
Long-document chunks	Weak	512-token model	256-token cap
Multilingual app	Weak	bge-m3 / multilingual-e5	English-focused training
Hands-off managed API	Weak	text-embedding-3-small	No infra, higher score

If your real problem is choosing and routing across many AI models and APIs rather than one embedding model, pair this with the text embedding models comparison and the AI API gateway guide.

Final Recommendation

Use all-MiniLM-L6-v2 as the default for prototypes, latency-critical retrieval, privacy-bound data, and high-volume embedding where its $0 per-token cost compounds. Upgrade within the free tier to bge-small or gte-small the moment retrieval accuracy or the 256-token cap starts hurting, and move to a paid API like text-embedding-3-small when you want the highest score with zero infrastructure. It is the right place to start and the wrong place to stop if quality becomes the bottleneck.

FAQ

What is all-MiniLM-L6-v2 used for?

It converts sentences or short paragraphs into 384-dimension vectors for semantic search, retrieval-augmented generation, clustering, and sentence similarity. It is one of the most widely used default embedding models in open-source RAG stacks.

Is all-MiniLM-L6-v2 free?

Yes. It is released under the Apache-2.0 license, free for commercial use, and runs locally with no API key or per-token cost. Your only cost is the compute you run it on.

How many dimensions does all-MiniLM-L6-v2 output?

384 dimensions. The vectors are L2-normalized, so you can use a dot product directly as cosine similarity in any vector database.

What is the max input length for all-MiniLM-L6-v2?

The model card states it truncates input beyond 256 word-pieces, and it was trained at 128 tokens. Although the tokenizer config allows 512, treat 256 as the practical ceiling and chunk longer text.

Is all-MiniLM-L6-v2 better than OpenAI embeddings?

No, not on accuracy. OpenAI's text-embedding-3-small scores higher on MTEB (about 62 vs 56) and handles far longer context. all-MiniLM wins on cost (free) and on local, private, low-latency inference.

all-MiniLM-L6-v2 vs bge-small: which should I use?

bge-small-en-v1.5 scores about 5 to 6 MTEB points higher at the same 384 dimensions and supports 512-token input, for a small size increase. Use all-MiniLM for maximum speed and minimum size; use bge-small when retrieval accuracy matters more.

Can I run all-MiniLM-L6-v2 in the browser?

Yes. The ONNX build via transformers.js runs the model in-browser or in serverless edge functions with no Python, which is one reason it is popular for lightweight and client-side semantic search.

Does TokenMix provide all-MiniLM-L6-v2?

No. all-MiniLM is a self-hosted open model, not a relayed API model. TokenMix routes hosted LLM, image, and video APIs through one endpoint; for embeddings you would self-host all-MiniLM or call a hosted embedding API.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. all-MiniLM-L6-v2 is a free self-hosted model rather than a relayed API, so this guide is published as independent model intelligence.

Sources

Hugging Face - sentence-transformers/all-MiniLM-L6-v2 model card - dimensions, params, max length, license, training data
Sentence-Transformers docs - pretrained models - speed and size comparison
Hugging Face - BAAI/bge-small-en-v1.5 - same-dimension upgrade, MTEB score
Hugging Face - thenlper/gte-small - same-dimension upgrade, MTEB score
OpenAI - new embedding models and API updates - text-embedding-3-small pricing and score
Galileo - mastering RAG, selecting an embedding model - MTEB aggregation and RAG attribution lift
Milvus - MiniLM vs mpnet comparison - quality vs speed trade
Hugging Face - Xenova/all-MiniLM-L6-v2 (ONNX) - browser and serverless runtime