Feature Overview
The online inference service manages how models are used, such as free quota usage, whether pay-per-token billing is enabled, security policies, and rate limiting policies, and so on.
Service Types
Online inference services are categorized into two types:
1. Default
The platform creates online inference services for all supported models by default. Users can enable Pay-as-you-go in the service list on the Model Gallery or Online Inference page to start using it. 2. Custom
If you need to customize the billing policy for model services or create multiple services to track usage statistics and manage permissions by team, you can create custom inference services in Online Inference. Custom inference services support a wider range of billing options, such as TPM Reservation. In the future, the platform will further support capabilities like intelligent routing, rate limiting rules, and plug-in enablement/disablement in custom services, helping you achieve more flexible service management and governance.
Service Status
Each online inference service has a status, as described below:
|
Not enabled | Default-type inference services remain in the Not enabled state before user activation. After postpaid is enabled, the service will transition to the Running state. |
Creating | When a service is enabled for the first time, it will briefly be in the Creating state and is expected to transition to Running within 5s. |
Running | The current service is accessible. |
Stopped | 1. When the account has overdue payment, pay-as-you-go services will become Stopped; when the account balance is replenished, the service will automatically return to Running. 2. When the postpaid service is manually disabled, the service status will change to Stopped. Users need to manually enable postpaid on the Online Inference page to restore the service. |
Billing Mode
The billing method indicates the payment status of the current service, as described below:
|
Pay-as-you-go | The current service has activated the postpaid billing method based on Token usage. |
TPM Reservation | The current service has TPM Reservation enabled. Traffic exceeding the TPM limit will be billed by Token. |
None | When postpaid is not enabled, there will be no billing status, and the service will be stopped. |