TokenMix Research Lab · 2026-04-30

Text Generation Inference OpenAI-Compatible API 2026 Guide
Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30
Text Generation Inference supports an OpenAI-compatible chat completions API. Use it when you want to serve open LLMs behind /v1/chat/completions.
Hugging Face's TGI Messages API docs say Text Generation Inference supports a Messages API fully compatible with the OpenAI Chat Completion API starting from TGI 1.4.0. The practical route is clear: deploy a model with TGI, expose /v1/chat/completions, point the OpenAI SDK at your TGI base_url, and set model to tgi. The caveat is also clear: TGI gives you control over open models and infrastructure, but you own GPU sizing, scaling, latency, uptime, and model quality.
Table of Contents
- Quick Answer
- Confirmed vs Caveat
- Architecture Snapshot
- Local Docker Setup
- Python OpenAI SDK Example
- Node TypeScript Example
- Hugging Face Inference Endpoints
- TGI vs vLLM vs Ollama vs TokenMix.ai
- Operational Cost Math
- Production Checklist
- When To Use TokenMix.ai Instead
- Final Recommendation
- FAQ
- Related Articles
- Sources
Quick Answer
TGI OpenAI-compatible API means you can call a TGI-served open model with OpenAI-style chat.completions.create() requests. The endpoint is usually:
http://localhost:8080/v1/chat/completions
or, for Hugging Face Inference Endpoints:
https://your-endpoint.endpoints.huggingface.cloud/v1/
Use TGI when you need open-model control, dedicated infrastructure, and self-managed inference. Use TokenMix.ai when you need one OpenAI-compatible API across hosted models without owning GPU operations.
Confirmed vs Caveat
| Claim | Status | Source / note |
|---|---|---|
| TGI supports OpenAI-compatible Messages API | Confirmed | Hugging Face TGI docs |
| Support starts from TGI 1.4.0 | Confirmed | TGI Messages API docs |
Endpoint path is /v1/chat/completions |
Confirmed | TGI docs and GitHub README |
| OpenAI Python client works | Confirmed | Official examples |
| Streaming works | Confirmed | Official examples |
| Hugging Face Inference Endpoints can expose TGI Messages API | Confirmed | Official docs |
| Every Hugging Face model works | No | Model needs chat template / compatible serving path |
| You avoid infrastructure work | No | TGI means you operate GPUs or endpoints |
| Tool calling always works | Caveat | Depends on model, serving path, and client expectations |
Architecture Snapshot
| Layer | TGI self-hosted | Hugging Face Inference Endpoint | TokenMix.ai |
|---|---|---|---|
| API shape | OpenAI-compatible chat completions | OpenAI-compatible chat completions | OpenAI-compatible multi-model API |
| Model source | Hugging Face model repo / local model | Hugging Face model repo | Hosted provider/model catalog |
| Infrastructure | You run it | Hugging Face managed endpoint | TokenMix.ai managed gateway |
| GPU responsibility | You | Hugging Face endpoint config | Provider/gateway side |
| Scaling | You configure | Endpoint-managed options | Gateway/provider routing |
| Best for | Dedicated open-model serving | Managed dedicated open-model serving | Fast multi-model app integration |
| Main risk | Ops burden | Endpoint cost and cold starts | Provider route capability differs by model |
Local Docker Setup
Hugging Face's TGI GitHub README shows the Docker path as the easiest starting point.
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.5 \
--model-id $model
Then call the OpenAI-compatible route:
curl http://localhost:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [
{"role": "system", "content": "You are a concise API assistant."},
{"role": "user", "content": "Explain TGI in one sentence."}
],
"stream": false,
"max_tokens": 100
}'
| Setup detail | Why it matters |
|---|---|
--gpus all |
TGI is designed for GPU inference |
--shm-size 1g |
Shared memory helps NCCL / tensor parallel behavior |
-p 8080:80 |
Exposes TGI on local port 8080 |
--model-id |
Determines the served model |
HF_TOKEN |
Needed for private or gated models |
/v1/chat/completions |
OpenAI-compatible Messages API route |
Python OpenAI SDK Example
pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="-",
)
completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a concise API assistant."},
{"role": "user", "content": "What is TGI?"},
],
max_tokens=120,
stream=False,
)
print(completion.choices[0].message.content)
Streaming:
stream = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "List three TGI use cases."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="")
Node TypeScript Example
npm install openai
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "-",
});
const completion = await client.chat.completions.create({
model: "tgi",
messages: [
{ role: "system", content: "You are a concise API assistant." },
{ role: "user", content: "Explain the TGI OpenAI-compatible API." },
],
max_tokens: 120,
});
console.log(completion.choices[0].message.content);
Hugging Face Inference Endpoints
For managed endpoints, Hugging Face's docs say every endpoint using TGI with an LLM that has a chat template can be used with the Messages API.
Critical detail: include the v1/ suffix in base_url.
from openai import OpenAI
client = OpenAI(
base_url="https://your-endpoint.region.aws.endpoints.huggingface.cloud/v1/",
api_key="hf_your_token",
)
completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "user", "content": "Summarize this endpoint architecture."}
],
stream=True,
)
for chunk in completion:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="")
| Endpoint mistake | Result |
|---|---|
Missing /v1/ suffix |
OpenAI SDK calls the wrong path |
| Wrong Hugging Face token | 401 or auth failure |
| Model lacks chat template | Messages API may fail or format poorly |
| Too-small GPU | Slow prefill/generation or OOM |
| No autoscaling policy | Cost waste or capacity shortage |
TGI vs vLLM vs Ollama vs TokenMix.ai
| Option | Best for | Not ideal for |
|---|---|---|
| TGI | Production open-model serving on dedicated GPUs | Teams that do not want infra work |
| vLLM | High-throughput serving and OpenAI-compatible endpoints | Teams wanting Hugging Face-native endpoint flow |
| Ollama | Local development and laptop/server experimentation | Multi-tenant production API traffic |
| Hugging Face Inference Endpoints | Managed dedicated open-model hosting | Lowest possible cost without endpoint management |
| TokenMix.ai | One API across many hosted providers/models | Deep custom self-hosted GPU tuning |
TGI is an inference server. TokenMix.ai is an API gateway and model access layer. They solve different problems.
Operational Cost Math
TGI cost depends on GPU hours, utilization, and engineering time. Do not compare it to a pay-per-token API by headline model price only.
| Scenario | Formula | Example result |
|---|---|---|
| Always-on endpoint | GPU hourly price x 730 hours |
At |