Skip to main content

Documentation Index

Fetch the complete documentation index at: https://wisdom-docs.juheapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

chat/completions is the most common API endpoint for LLMs, which takes a conversation list composed of multiple messages as input to get model responses. This endpoint follows the OpenAI Chat Completions API format, making it easy to integrate with existing OpenAI-compatible code.

Important Notes

Model DifferencesDifferent model providers may support different request parameters and return different response fields. We strongly recommend consulting the model catalog for complete parameter lists and usage instructions for each model.
Response Pass-through PrincipleWisGate typically does not modify model responses outside of reverse format, ensuring you receive response content consistent with the original API provider.
Streaming SupportWisGate supports Server-Sent Events (SSE) for streaming responses. Set "stream": true in your request to enable real-time streaming, which is useful for chat applications.

Auto-Generated DocumentationThe request parameters and response format are automatically generated from the OpenAPI specification. All parameters, their types, descriptions, defaults, and examples are pulled directly from openapi.json. Scroll down to see the interactive API reference.

FAQ

How to handle rate limits?

When encountering 429 Too Many Requests, we recommend implementing exponential backoff retry:
import time
import random

def chat_with_retry(messages, max_retries=3):
    for i in range(max_retries):
        try:
            response = client.chat.completions.create(
                model="gpt-4",
                messages=messages
            )
            return response
        except RateLimitError:
            if i < max_retries - 1:
                wait_time = (2 ** i) + random.random()
                time.sleep(wait_time)
            else:
                raise

How to maintain conversation context?

Include the complete conversation history in the messages array:
conversation_history = [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a programming language..."},
    {"role": "user", "content": "What are its advantages?"}
]

response = client.chat.completions.create(
    model="gpt-4",
    messages=conversation_history
)

What does finish_reason mean?

ValueMeaning
stopNatural completion
lengthReached max_tokens limit
content_filterTriggered content filter
function_callModel called a function

How to control costs?

  1. Use max_tokens to limit output length
  2. Choose appropriate models (e.g., GPT-3.5 Turbo is more economical than GPT-4)
  3. Streamline prompts, avoid redundant context
  4. Monitor token consumption in the usage field of responses

How to use streaming?

Enable streaming by setting stream: true:
stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")