TokenMix Research Lab · 2026-04-30

Text Generation Inference OpenAI-Compatible API 2026 Guide

Text Generation Inference OpenAI-Compatible API 2026 Guide

Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30

Text Generation Inference supports an OpenAI-compatible chat completions API. Use it when you want to serve open LLMs behind /v1/chat/completions.

Hugging Face's TGI Messages API docs say Text Generation Inference supports a Messages API fully compatible with the OpenAI Chat Completion API starting from TGI 1.4.0. The practical route is clear: deploy a model with TGI, expose /v1/chat/completions, point the OpenAI SDK at your TGI base_url, and set model to tgi. The caveat is also clear: TGI gives you control over open models and infrastructure, but you own GPU sizing, scaling, latency, uptime, and model quality.

Table of Contents

Quick Answer

TGI OpenAI-compatible API means you can call a TGI-served open model with OpenAI-style chat.completions.create() requests. The endpoint is usually:

http://localhost:8080/v1/chat/completions

or, for Hugging Face Inference Endpoints:

https://your-endpoint.endpoints.huggingface.cloud/v1/

Use TGI when you need open-model control, dedicated infrastructure, and self-managed inference. Use TokenMix.ai when you need one OpenAI-compatible API across hosted models without owning GPU operations.

Confirmed vs Caveat

Claim Status Source / note
TGI supports OpenAI-compatible Messages API Confirmed Hugging Face TGI docs
Support starts from TGI 1.4.0 Confirmed TGI Messages API docs
Endpoint path is /v1/chat/completions Confirmed TGI docs and GitHub README
OpenAI Python client works Confirmed Official examples
Streaming works Confirmed Official examples
Hugging Face Inference Endpoints can expose TGI Messages API Confirmed Official docs
Every Hugging Face model works No Model needs chat template / compatible serving path
You avoid infrastructure work No TGI means you operate GPUs or endpoints
Tool calling always works Caveat Depends on model, serving path, and client expectations

Architecture Snapshot

Layer TGI self-hosted Hugging Face Inference Endpoint TokenMix.ai
API shape OpenAI-compatible chat completions OpenAI-compatible chat completions OpenAI-compatible multi-model API
Model source Hugging Face model repo / local model Hugging Face model repo Hosted provider/model catalog
Infrastructure You run it Hugging Face managed endpoint TokenMix.ai managed gateway
GPU responsibility You Hugging Face endpoint config Provider/gateway side
Scaling You configure Endpoint-managed options Gateway/provider routing
Best for Dedicated open-model serving Managed dedicated open-model serving Fast multi-model app integration
Main risk Ops burden Endpoint cost and cold starts Provider route capability differs by model

Local Docker Setup

Hugging Face's TGI GitHub README shows the Docker path as the easiest starting point.

model=HuggingFaceH4/zephyr-7b-beta
volume=$PWD/data

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
  ghcr.io/huggingface/text-generation-inference:3.3.5 \
  --model-id $model

Then call the OpenAI-compatible route:

curl http://localhost:8080/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tgi",
    "messages": [
      {"role": "system", "content": "You are a concise API assistant."},
      {"role": "user", "content": "Explain TGI in one sentence."}
    ],
    "stream": false,
    "max_tokens": 100
  }'
Setup detail Why it matters
--gpus all TGI is designed for GPU inference
--shm-size 1g Shared memory helps NCCL / tensor parallel behavior
-p 8080:80 Exposes TGI on local port 8080
--model-id Determines the served model
HF_TOKEN Needed for private or gated models
/v1/chat/completions OpenAI-compatible Messages API route

Python OpenAI SDK Example

pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="-",
)

completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a concise API assistant."},
        {"role": "user", "content": "What is TGI?"},
    ],
    max_tokens=120,
    stream=False,
)

print(completion.choices[0].message.content)

Streaming:

stream = client.chat.completions.create(
    model="tgi",
    messages=[{"role": "user", "content": "List three TGI use cases."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

Node TypeScript Example

npm install openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "-",
});

const completion = await client.chat.completions.create({
  model: "tgi",
  messages: [
    { role: "system", content: "You are a concise API assistant." },
    { role: "user", content: "Explain the TGI OpenAI-compatible API." },
  ],
  max_tokens: 120,
});

console.log(completion.choices[0].message.content);

Hugging Face Inference Endpoints

For managed endpoints, Hugging Face's docs say every endpoint using TGI with an LLM that has a chat template can be used with the Messages API.

Critical detail: include the v1/ suffix in base_url.

from openai import OpenAI

client = OpenAI(
    base_url="https://your-endpoint.region.aws.endpoints.huggingface.cloud/v1/",
    api_key="hf_your_token",
)

completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "user", "content": "Summarize this endpoint architecture."}
    ],
    stream=True,
)

for chunk in completion:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")
Endpoint mistake Result
Missing /v1/ suffix OpenAI SDK calls the wrong path
Wrong Hugging Face token 401 or auth failure
Model lacks chat template Messages API may fail or format poorly
Too-small GPU Slow prefill/generation or OOM
No autoscaling policy Cost waste or capacity shortage

TGI vs vLLM vs Ollama vs TokenMix.ai

Option Best for Not ideal for
TGI Production open-model serving on dedicated GPUs Teams that do not want infra work
vLLM High-throughput serving and OpenAI-compatible endpoints Teams wanting Hugging Face-native endpoint flow
Ollama Local development and laptop/server experimentation Multi-tenant production API traffic
Hugging Face Inference Endpoints Managed dedicated open-model hosting Lowest possible cost without endpoint management
TokenMix.ai One API across many hosted providers/models Deep custom self-hosted GPU tuning

TGI is an inference server. TokenMix.ai is an API gateway and model access layer. They solve different problems.

Operational Cost Math

TGI cost depends on GPU hours, utilization, and engineering time. Do not compare it to a pay-per-token API by headline model price only.

Scenario Formula Example result
Always-on endpoint GPU hourly price x 730 hours At .50/hour, about ,095/month
Business-hours endpoint GPU hourly price x 220 hours At .50/hour, about $330/month
Scale-to-zero endpoint active hours x GPU hourly price + cold-start cost Good for bursty workloads
Token API route input tokens x input price + output tokens x output price Better for uneven traffic

The example GPU price is illustrative. Use your actual cloud, region, GPU, reserved/spot terms, and endpoint policy.

Decision table:

Traffic pattern Better first choice
Low traffic, unpredictable usage TokenMix.ai or hosted pay-per-token API
Stable high traffic TGI or dedicated endpoint can win
Sensitive model customization TGI or managed dedicated endpoint
Many providers and fallback TokenMix.ai
Need one OpenAI SDK path quickly TokenMix.ai, TGI, or HF endpoint depending model ownership

Production Checklist

Check Why it matters
Confirm TGI version is 1.4.0+ Messages API support starts there
Confirm model chat template Chat completions require chat formatting
Set max input/output lengths Prevent OOM and runaway cost
Test streaming parser Streaming chunks differ across clients
Load test prefill and decode Long prompts can bottleneck prefill
Add health checks Production routing needs liveness
Export metrics TGI supports production observability patterns
Plan scale-to-zero Dedicated endpoints can burn idle cost
Protect gated model token Use HF_TOKEN safely
Add fallback route Do not let one endpoint be your only production path

When To Use TokenMix.ai Instead

Use TokenMix.ai when the goal is app-level API routing, not GPU serving.

Need TGI TokenMix.ai
Serve a specific open model yourself Strong Not the point
Avoid GPU operations Weak Strong
Route OpenAI, Claude, Gemini, DeepSeek, Qwen, Kimi Manual Strong
OpenAI SDK compatibility Yes Yes
Provider fallback You build it Gateway-level capability
Fast launch Medium Strong
Deep model/runtime tuning Strong Not the point

If your engineering team wants to own inference, use TGI. If your product team wants model access, fallback, pricing comparison, and one endpoint, start with TokenMix.ai.

Final Recommendation

Use TGI OpenAI-compatible API when you need dedicated open-model inference and can operate GPUs. Use Hugging Face Inference Endpoints when you want TGI with less infrastructure work.

Use TokenMix.ai when your real requirement is not self-hosting, but one OpenAI-compatible API across many hosted models.

FAQ

Does TGI support the OpenAI API?

Yes. TGI supports a Messages API compatible with the OpenAI Chat Completion API starting from TGI 1.4.0.

What endpoint path should I use?

Use /v1/chat/completions. For the OpenAI SDK, set base_url or baseURL to the server root ending in /v1.

What model name should I use?

For TGI examples, Hugging Face uses model="tgi". The served model is selected when you launch the TGI server with --model-id.

Does TGI support streaming?

Yes. TGI's Messages API examples include streaming with the OpenAI Python client.

Can I use TGI through Hugging Face Inference Endpoints?

Yes. Hugging Face docs say TGI-backed Inference Endpoints with chat-template LLMs can use the Messages API. Include the v1/ suffix in the endpoint URL.

Is TGI cheaper than hosted APIs?

Sometimes. It depends on utilization. Always-on GPUs can be expensive when traffic is low; dedicated serving can win when traffic is stable and high.

Does TGI replace TokenMix.ai?

No. TGI serves models. TokenMix.ai routes across hosted models and providers. They solve different layers of the stack.

Should I use TGI for production?

Use TGI for production if you can operate GPU infrastructure, monitor latency, manage scaling, and handle fallback. Otherwise use a managed endpoint or gateway first.

Related Articles

Sources