llama-gguf-inference Documentation
GGUF model inference server using llama.cpp
Welcome to the documentation for llama-gguf-inference, a production-ready inference server for GGUF models.
Quick Start
# Run with Docker
docker run --gpus all \
-v /path/to/models:/data/models \
-e MODEL_NAME=your-model.gguf \
-p 8000:8000 \
ghcr.io/zepfu/llama-gguf-inference
Features
OpenAI-compatible API - Drop-in replacement
Authentication - API key-based access control
Streaming support - Real-time token streaming
Health monitoring - Separate health check port
Platform agnostic - Works anywhere Docker runs
API Reference
Gateway Module
gateway.py – Async HTTP gateway for llama.cpp llama-server
Features: - API key authentication with health endpoint exemption - Proper SSE/streaming support for chat completions - /ping and /health endpoints (no auth required) - Request timeout handling - Basic metrics tracking - Graceful error handling - Per-key access logging
Note: /ping and /health return 200 immediately without backend checks or authentication to enable scale-to-zero in serverless environments.
- Environment Variables:
GATEWAY_PORT - Port to listen on (default: 8000) BACKEND_HOST - llama-server host (default: 127.0.0.1) PORT_BACKEND - llama-server port (default: 8080) BACKEND_API_KEY - API key for backend authentication (optional) REQUEST_TIMEOUT - Max request time in seconds (default: 300) HEALTH_TIMEOUT - Health check timeout in seconds (default: 2) AUTH_ENABLED - Enable API key authentication (default: true) AUTH_KEYS_FILE - Path to API keys file (default: $DATA_DIR/api_keys.txt)
- class gateway.Metrics(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>)[source]
Bases:
object- requests_total: int = 0
- requests_success: int = 0
- requests_error: int = 0
- requests_active: int = 0
- requests_authenticated: int = 0
- requests_unauthorized: int = 0
- bytes_sent: int = 0
- start_time: float
- __init__(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>) None
- async gateway.backend_health_check() dict[source]
Check backend health via /health endpoint.
Returns health status dict or error.
- async gateway.handle_ping(writer: StreamWriter)[source]
Handle /ping endpoint for RunPod health checks.
Always returns 200 OK without authentication or backend checks. For detailed backend status, use /health endpoint instead.
- async gateway.handle_health(writer: StreamWriter)[source]
Handle /health endpoint with detailed backend status.
No authentication required.
- async gateway.handle_metrics(writer: StreamWriter)[source]
Handle /metrics endpoint.
No authentication required.
- async gateway.proxy_request(method: str, path: str, headers: dict, body: bytes | None, writer: StreamWriter, key_id: str = 'unknown')[source]
Proxy a request to the backend with streaming support.
- Parameters:
method – HTTP method (GET, POST, etc.)
path – Request path to forward to backend
headers – Request headers dict (lowercase keys)
body – Request body bytes, or None for bodyless requests
writer – asyncio StreamWriter for the client connection
key_id – The authenticated key_id for logging
Authentication Module
auth.py - API Key Authentication Module for Gateway
File-based authentication system that enforces API keys while maintaining OpenAI compatibility. Uses key_id:api_key format for easy management and auditing.
- Usage:
from auth import api_validator, authenticate_request, log_access
# In your request handler: api_key_info = await authenticate_request(writer, headers) if api_key_info is None:
# 401 response already sent return
key_id = api_key_info # The key_id for logging/tracking
# Process request… await log_access(method, path, key_id, status_code)
- Environment Variables:
AUTH_ENABLED - Enable/disable authentication (default: true) AUTH_KEYS_FILE - Path to keys file (default: $DATA_DIR/api_keys.txt) MAX_REQUESTS_PER_MINUTE - Rate limit per key_id (default: 100)
- Keys File Format:
key_id:api_key
Example
production:sk-prod-abc123def456 alice-laptop:sk-alice-xyz789 development:sk-dev-test123
- class auth.APIKeyValidator[source]
Bases:
objectValidates API keys for incoming requests.
Features: - File-based configuration (key_id:api_key format) - Rate limiting per key_id - Key format validation - Audit trail with key_id tracking
- async auth.authenticate_request(writer: StreamWriter, headers: dict) str | None[source]
Authenticate an incoming request.
If authentication fails, sends a 401 response and returns None. If authentication succeeds, returns the key_id.
- Parameters:
writer – asyncio StreamWriter to send response
headers – Request headers (lowercase keys)
- Returns:
key_id if valid, None if invalid (401 response already sent)
- async auth.send_rate_limit_error(writer: StreamWriter)[source]
Send OpenAI-compatible rate limit error (429).
- async auth.log_access(method: str, path: str, key_id: str, status_code: int)[source]
Log API access for auditing.
Logs to: /data/logs/api_access.log
Format: ISO8601_timestamp | key_id | method path | status_code Example: 2024-02-06T14:30:22.123456 | production | POST /v1/chat/completions | 200
- Parameters:
method – HTTP method
path – Request path
key_id – The key identifier (e.g., “production”, “alice-laptop”)
status_code – HTTP status code
Health Server Module
health_server.py — Ultra-lightweight health check server for RunPod
This server runs on PORT_HEALTH (separate from the main gateway) and provides a minimal health check endpoint that doesn’t interact with the backend.
This prevents RunPod’s periodic health checks from keeping the worker “active” and allows proper scale-to-zero behavior in serverless environments.
- Environment Variables:
PORT_HEALTH - Port to listen on (default: 8001)