Text Generation Guide

Learn how to use our powerful text generation tools and API features.

Powered by Aphrodite-Engine and vLLM

Arli AI Text Generation is powered by both Aphrodite-Engine and vLLM depending on the models. As such most of our available generation parameters will be similar to those available in Aphrodite-Engine.

https://github.com/aphrodite-engine/aphrodite-engine https://github.com/vllm-project/vllm

Authentication & Usage

All Text Generation API endpoints require authentication using a Bearer Authentication via the Authorization. Replace {ARLIAI_API_KEY} in the examples with your actual API key.

Ensure you have access granted to the specific Text Generation models you intend to use. Free accounts are able to use each model for a maximum of 5 requests every 2 days for testing purposes. Text generation requests are subject to rate limits and concurrency limits based on your account plan. Exceeding limits may result in temporary account restrictions.

API Key parameter overrides (set in your account settings) will merge with and take precedence over parameters sent in the request body for allowed parameters.

Parallel Requests

The number of requests you can make at the same time is determined by the parallel requests allowed for your account.

If you try to send more requests in parallel than allowed, the request will be blocked.

API Type

Note that using aphrodite-only parameters on a vLLM model will not do anything. The same applies to using vLLM-only parameters on an aphrodite model.

API Key Features

Your API keys are more than just for authentication. From the Account page, you can configure powerful overrides and settings that apply to every API request made with that key. This is perfect for using our API with third-party clients that may not support all of our unique features.

Parameter Overrides

You can set default generation parameters directly on your API key. These settings will override any parameters sent in an API request. This allows you to enforce specific settings or use advanced features not exposed in other interfaces.

Standard Parameters: Set defaults for temperature, top_p, top_k, repetition_penalty, and more.
Multi-Model (Arli AI Custom): Provide a list of models in the multi_models field. The API will randomly select one model from this list for each request, which is great for variety or A/B testing.
Multiplier (Arli AI Custom): For fine-tuned models, the multiplier adjusts the LoRA alpha value, controlling the strength of the fine-tune.
Hide Thinking (Arli AI Custom): Enable the hide_thinking checkbox to ensure the model's reasoning process (content within <think>...</think> tags) is stripped from the final output.

Model Filtering

Use the "Model Filter" option to specify a whitelist of models that can be accessed with a particular API key. When this key is used to query the /v1/models/textgen-models endpoint, only the models from your filtered list will be returned.

Advanced Chat

The Advanced Chat is a powerful interface for interacting with a single model. It offers extensive control over the chat process and session management.

Generation Parameters

These are the core parameters that control the text generation process. They are available in both Advanced Chat and Arena Chat modes.

Default System Prompt: Define a custom system prompt to guide the AI's behavior, personality, and response format. This is a powerful tool for creating specialized assistants.
Temperature: Controls the randomness of the output. A higher value (e.g., 1.0) results in more creative and diverse responses, while a lower value (e.g., 0.1) makes the output more deterministic and focused.
Top-P (Nucleus Sampling): An alternative to Temperature. It samples from the smallest set of tokens whose cumulative probability exceeds the value `p`. This can produce more coherent text than high-temperature sampling.
Min-P: Sets a minimum probability for tokens to be considered. Tokens with a probability below this threshold are excluded from sampling.
Top-K: Restricts sampling to the `k` most likely next tokens. A value of -1 disables this feature.
Repetition Penalty: Penalizes tokens that have already appeared in the conversation, reducing the likelihood of the model repeating itself. A value greater than 1.0 will discourage repetition.
Presence Penalty: Penalizes new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
Frequency Penalty: Penalizes new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
Max Tokens: The maximum number of tokens (words and punctuation) that the model can generate in a single response.
Seed: An integer used to initialize the random number generator. Using the same seed will produce the same output for the same prompt and parameters. Leave empty for a random seed.
Smoothing Factor & Curve: Applies a transformation to the logits to flatten the probability distribution, which can help in reducing the model's tendency to get stuck on repetitive loops.
Top-A: Another sampling method that removes tokens with a probability less than `p_top_a ^ 2 * p_top_a_max`, where `p_top_a_max` is the probability of the most likely token.
TFS (Tail Free Sampling): A sampling technique that removes tokens from the tail end of the probability distribution, helping to eliminate unlikely but possible tokens.
Skew: A parameter that can be used to skew the probability distribution towards more or less likely tokens.

Advanced Sampling (DRY)

DRY (Don't Repeat Yourself) sampling is a set of parameters designed to prevent the model from repeating sequences of tokens.

DRY Multiplier: Controls the strength of the DRY penalty.
DRY Base: The base value for the DRY penalty calculation.
DRY Allowed Length: The length of token sequence that is allowed to be repeated before the penalty is applied.

Other Features

Session Management: All your chats are saved locally. You can create, rename, delete, and switch between chats from the right-hand sidebar.
Import/Export: Save and load your chat history as a JSON file.
Message Editing & Regeneration: Edit any message in the conversation or regenerate a response at any point.

Arena Chat

The Arena Chat provides a split-screen interface to compare two models or two different sets of settings simultaneously.

How to Use

Configure each model pane (Model 1 and Model 2) with its own API key, endpoint, model, and system prompt.
Adjust the common generation parameters in the sidebar that will apply to both models.
Type a single message in the input box at the bottom.
Click send to receive responses from both models in their respective panes, allowing for a direct comparison of their output.

Key Settings

Independent Panes: Each side (Model 1 and Model 2) has its own independent settings for the API key, endpoint, model selection, and system prompt. This allows you to compare completely different models or providers.
Shared Parameters: The generation parameters in the main sidebar (Temperature, Top-P, etc.) are applied to *both* models simultaneously, ensuring a fair, side-by-side comparison of their responses under the same conditions.