TokenMix Research Lab · 2026-04-30

Text Generation Inference OpenAI-Compatible API 2026 Guide
Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30
Text Generation Inference supports an OpenAI-compatible chat completions API. Use it when you want to serve open LLMs behind /v1/chat/completions.
Hugging Face's TGI Messages API docs say Text Generation Inference supports a Messages API fully compatible with the OpenAI Chat Completion API starting from TGI 1.4.0. The practical route is clear: deploy a model with TGI, expose /v1/chat/completions, point the OpenAI SDK at your TGI base_url, and set model to tgi. The caveat is also clear: TGI gives you control over open models and infrastructure, but you own GPU sizing, scaling, latency, uptime, and model quality.
Table of Contents
- Quick Answer
- Confirmed vs Caveat
- Architecture Snapshot
- Local Docker Setup
- Python OpenAI SDK Example
- Node TypeScript Example
- Hugging Face Inference Endpoints
- TGI vs vLLM vs Ollama vs TokenMix.ai
- Operational Cost Math
- Production Checklist
- When To Use TokenMix.ai Instead
- Final Recommendation
- FAQ
- Related Articles
- Sources
Quick Answer
TGI OpenAI-compatible API means you can call a TGI-served open model with OpenAI-style chat.completions.create() requests. The endpoint is usually:
http://localhost:8080/v1/chat/completions
or, for Hugging Face Inference Endpoints:
https://your-endpoint.endpoints.huggingface.cloud/v1/
Use TGI when you need open-model control, dedicated infrastructure, and self-managed inference. Use TokenMix.ai when you need one OpenAI-compatible API across hosted models without owning GPU operations.
Confirmed vs Caveat
| Claim | Status | Source / note |
|---|---|---|
| TGI supports OpenAI-compatible Messages API | Confirmed | Hugging Face TGI docs |
| Support starts from TGI 1.4.0 | Confirmed | TGI Messages API docs |
Endpoint path is /v1/chat/completions |
Confirmed | TGI docs and GitHub README |
| OpenAI Python client works | Confirmed | Official examples |
| Streaming works | Confirmed | Official examples |
| Hugging Face Inference Endpoints can expose TGI Messages API | Confirmed | Official docs |
| Every Hugging Face model works | No | Model needs chat template / compatible serving path |
| You avoid infrastructure work | No | TGI means you operate GPUs or endpoints |
| Tool calling always works | Caveat | Depends on model, serving path, and client expectations |
Architecture Snapshot
| Layer | TGI self-hosted | Hugging Face Inference Endpoint | TokenMix.ai |
|---|---|---|---|
| API shape | OpenAI-compatible chat completions | OpenAI-compatible chat completions | OpenAI-compatible multi-model API |
| Model source | Hugging Face model repo / local model | Hugging Face model repo | Hosted provider/model catalog |
| Infrastructure | You run it | Hugging Face managed endpoint | TokenMix.ai managed gateway |
| GPU responsibility | You | Hugging Face endpoint config | Provider/gateway side |
| Scaling | You configure | Endpoint-managed options | Gateway/provider routing |
| Best for | Dedicated open-model serving | Managed dedicated open-model serving | Fast multi-model app integration |
| Main risk | Ops burden | Endpoint cost and cold starts | Provider route capability differs by model |
Local Docker Setup
Hugging Face's TGI GitHub README shows the Docker path as the easiest starting point.
model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.3.5 \
--model-id $model
Then call the OpenAI-compatible route:
curl http://localhost:8080/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "tgi",
"messages": [
{"role": "system", "content": "You are a concise API assistant."},
{"role": "user", "content": "Explain TGI in one sentence."}
],
"stream": false,
"max_tokens": 100
}'
| Setup detail | Why it matters |
|---|---|
--gpus all |
TGI is designed for GPU inference |
--shm-size 1g |
Shared memory helps NCCL / tensor parallel behavior |
-p 8080:80 |
Exposes TGI on local port 8080 |
--model-id |
Determines the served model |
HF_TOKEN |
Needed for private or gated models |
/v1/chat/completions |
OpenAI-compatible Messages API route |
Python OpenAI SDK Example
pip install openai
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="-",
)
completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "system", "content": "You are a concise API assistant."},
{"role": "user", "content": "What is TGI?"},
],
max_tokens=120,
stream=False,
)
print(completion.choices[0].message.content)
Streaming:
stream = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "List three TGI use cases."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="")
Node TypeScript Example
npm install openai
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "-",
});
const completion = await client.chat.completions.create({
model: "tgi",
messages: [
{ role: "system", content: "You are a concise API assistant." },
{ role: "user", content: "Explain the TGI OpenAI-compatible API." },
],
max_tokens: 120,
});
console.log(completion.choices[0].message.content);
Hugging Face Inference Endpoints
For managed endpoints, Hugging Face's docs say every endpoint using TGI with an LLM that has a chat template can be used with the Messages API.
Critical detail: include the v1/ suffix in base_url.
from openai import OpenAI
client = OpenAI(
base_url="https://your-endpoint.region.aws.endpoints.huggingface.cloud/v1/",
api_key="hf_your_token",
)
completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "user", "content": "Summarize this endpoint architecture."}
],
stream=True,
)
for chunk in completion:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="")
| Endpoint mistake | Result |
|---|---|
Missing /v1/ suffix |
OpenAI SDK calls the wrong path |
| Wrong Hugging Face token | 401 or auth failure |
| Model lacks chat template | Messages API may fail or format poorly |
| Too-small GPU | Slow prefill/generation or OOM |
| No autoscaling policy | Cost waste or capacity shortage |
TGI vs vLLM vs Ollama vs TokenMix.ai
| Option | Best for | Not ideal for |
|---|---|---|
| TGI | Production open-model serving on dedicated GPUs | Teams that do not want infra work |
| vLLM | High-throughput serving and OpenAI-compatible endpoints | Teams wanting Hugging Face-native endpoint flow |
| Ollama | Local development and laptop/server experimentation | Multi-tenant production API traffic |
| Hugging Face Inference Endpoints | Managed dedicated open-model hosting | Lowest possible cost without endpoint management |
| TokenMix.ai | One API across many hosted providers/models | Deep custom self-hosted GPU tuning |
TGI is an inference server. TokenMix.ai is an API gateway and model access layer. They solve different problems.
Operational Cost Math
TGI cost depends on GPU hours, utilization, and engineering time. Do not compare it to a pay-per-token API by headline model price only.
| Scenario | Formula | Example result |
|---|---|---|
| Always-on endpoint | GPU hourly price x 730 hours |
At $1.50/hour, about $1,095/month |
| Business-hours endpoint | GPU hourly price x 220 hours |
At $1.50/hour, about $330/month |
| Scale-to-zero endpoint | active hours x GPU hourly price + cold-start cost |
Good for bursty workloads |
| Token API route | input tokens x input price + output tokens x output price |
Better for uneven traffic |
The example GPU price is illustrative. Use your actual cloud, region, GPU, reserved/spot terms, and endpoint policy.
Decision table:
| Traffic pattern | Better first choice |
|---|---|
| Low traffic, unpredictable usage | TokenMix.ai or hosted pay-per-token API |
| Stable high traffic | TGI or dedicated endpoint can win |
| Sensitive model customization | TGI or managed dedicated endpoint |
| Many providers and fallback | TokenMix.ai |
| Need one OpenAI SDK path quickly | TokenMix.ai, TGI, or HF endpoint depending model ownership |
Production Checklist
| Check | Why it matters |
|---|---|
| Confirm TGI version is 1.4.0+ | Messages API support starts there |
| Confirm model chat template | Chat completions require chat formatting |
| Set max input/output lengths | Prevent OOM and runaway cost |
| Test streaming parser | Streaming chunks differ across clients |
| Load test prefill and decode | Long prompts can bottleneck prefill |
| Add health checks | Production routing needs liveness |
| Export metrics | TGI supports production observability patterns |
| Plan scale-to-zero | Dedicated endpoints can burn idle cost |
| Protect gated model token | Use HF_TOKEN safely |
| Add fallback route | Do not let one endpoint be your only production path |
When To Use TokenMix.ai Instead
Use TokenMix.ai when the goal is app-level API routing, not GPU serving.
| Need | TGI | TokenMix.ai |
|---|---|---|
| Serve a specific open model yourself | Strong | Not the point |
| Avoid GPU operations | Weak | Strong |
| Route OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi | Manual | Strong |
| OpenAI SDK compatibility | Yes | Yes |
| Provider fallback | You build it | Gateway-level capability |
| Fast launch | Medium | Strong |
| Deep model/runtime tuning | Strong | Not the point |
If your engineering team wants to own inference, use TGI. If your product team wants model access, fallback, pricing comparison, and one endpoint, start with TokenMix.ai.
Final Recommendation
Use TGI OpenAI-compatible API when you need dedicated open-model inference and can operate GPUs. Use Hugging Face Inference Endpoints when you want TGI with less infrastructure work.
Use TokenMix.ai when your real requirement is not self-hosting, but one OpenAI-compatible API across many hosted models.
FAQ
Does TGI support the OpenAI API?
Yes. TGI supports a Messages API compatible with the OpenAI Chat Completion API starting from TGI 1.4.0.
What endpoint path should I use?
Use /v1/chat/completions. For the OpenAI SDK, set base_url or baseURL to the server root ending in /v1.
What model name should I use?
For TGI examples, Hugging Face uses model="tgi". The served model is selected when you launch the TGI server with --model-id.
Does TGI support streaming?
Yes. TGI's Messages API examples include streaming with the OpenAI Python client.
Can I use TGI through Hugging Face Inference Endpoints?
Yes. Hugging Face docs say TGI-backed Inference Endpoints with chat-template LLMs can use the Messages API. Include the v1/ suffix in the endpoint URL.
Is TGI cheaper than hosted APIs?
Sometimes. It depends on utilization. Always-on GPUs can be expensive when traffic is low; dedicated serving can win when traffic is stable and high.
Does TGI replace TokenMix.ai?
No. TGI serves models. TokenMix.ai routes across hosted models and providers. They solve different layers of the stack.
Should I use TGI for production?
Use TGI for production if you can operate GPU infrastructure, monitor latency, manage scaling, and handle fallback. Otherwise use a managed endpoint or gateway first.
Related Articles
- OpenAI-Compatible API Guide 2026: One SDK, Many Models
- Ollama OpenAI-Compatible API: Local Setup and Hosted Alternatives
- Gemini OpenAI-Compatible API: Use Gemini with the OpenAI SDK
- Anthropic OpenAI-Compatible API 2026: Claude SDK Setup Guide
- LLM API Gateway 2026: Routing, Fallbacks, Spend Control
- OpenRouter Alternatives 2026: Pricing, Models, Payments
- LLM API Pricing Comparison 2026: GPT, Claude, Gemini, DeepSeek
Sources
- Hugging Face TGI Messages API: https://huggingface.co/docs/text-generation-inference/main/messages_api
- Hugging Face TGI GitHub README: https://github.com/huggingface/text-generation-inference
- Hugging Face InferenceClient guide: https://github.com/huggingface/huggingface_hub/blob/main/docs/source/en/guides/inference.md
- Hugging Face TGI Messages API blog: https://huggingface.co/blog/tgi-messages-api