llama-gguf-inference Documentation

GGUF model inference server using llama.cpp

Welcome to the documentation for llama-gguf-inference, a production-ready inference server for GGUF models.

Quick Start

# Run with Docker
docker run --gpus all \
  -v /path/to/models:/data/models \
  -e MODEL_NAME=your-model.gguf \
  -p 8000:8000 \
  ghcr.io/zepfu/llama-gguf-inference

Features

OpenAI-compatible API - Drop-in replacement
Authentication - API key-based access control
Streaming support - Real-time token streaming
Health monitoring - Separate health check port
Platform agnostic - Works anywhere Docker runs

API Reference

Gateway Module

gateway.py – Async HTTP gateway for llama.cpp llama-server

Features: - API key authentication with health endpoint exemption - Proper SSE/streaming support for chat completions - /ping and /health endpoints (no auth required) - Request timeout handling - Basic metrics tracking - Graceful error handling - Per-key access logging

Note: /ping and /health return 200 immediately without backend checks or authentication to enable scale-to-zero in serverless environments.

Environment Variables:: GATEWAY_PORT - Port to listen on (default: 8000) BACKEND_HOST - llama-server host (default: 127.0.0.1) PORT_BACKEND - llama-server port (default: 8080) BACKEND_API_KEY - API key for backend authentication (optional) REQUEST_TIMEOUT - Max request time in seconds (default: 300) HEALTH_TIMEOUT - Health check timeout in seconds (default: 2) AUTH_ENABLED - Enable API key authentication (default: true) AUTH_KEYS_FILE - Path to API keys file (default: $DATA_DIR/api_keys.txt)

gateway.log(msg: str)[source]: Simple logging to stderr.

class gateway.Metrics(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>)[source]

Bases: object

requests_total: int = 0

requests_success: int = 0

requests_error: int = 0

requests_active: int = 0

requests_authenticated: int = 0

requests_unauthorized: int = 0

bytes_sent: int = 0

start_time: float

to_dict()[source]

__init__(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>) → None

gateway.backend_tcp_ready() → bool[source]: Check if backend is accepting TCP connections.

async gateway.backend_health_check() → dict[source]

Check backend health via /health endpoint.

Returns health status dict or error.

async gateway.handle_ping(writer: StreamWriter)[source]

Handle /ping endpoint for RunPod health checks.

Always returns 200 OK without authentication or backend checks. For detailed backend status, use /health endpoint instead.

async gateway.handle_health(writer: StreamWriter)[source]

Handle /health endpoint with detailed backend status.

No authentication required.

async gateway.handle_metrics(writer: StreamWriter)[source]

Handle /metrics endpoint.

No authentication required.

async gateway.proxy_request(method: str, path: str, headers: dict, body: bytes | None, writer: StreamWriter, key_id: str = 'unknown')[source]

Proxy a request to the backend with streaming support.

Parameters:

method – HTTP method (GET, POST, etc.)
path – Request path to forward to backend
headers – Request headers dict (lowercase keys)
body – Request body bytes, or None for bodyless requests
writer – asyncio StreamWriter for the client connection
key_id – The authenticated key_id for logging

async gateway.handle_client(reader: StreamReader, writer: StreamWriter)[source]: Handle an incoming client connection.

async gateway.main()[source]: Start the gateway server.

Authentication Module

auth.py - API Key Authentication Module for Gateway

File-based authentication system that enforces API keys while maintaining OpenAI compatibility. Uses key_id:api_key format for easy management and auditing.

Usage:

from auth import api_validator, authenticate_request, log_access

# In your request handler: api_key_info = await authenticate_request(writer, headers) if api_key_info is None:

# 401 response already sent return

key_id = api_key_info # The key_id for logging/tracking

# Process request… await log_access(method, path, key_id, status_code)

Environment Variables:

AUTH_ENABLED - Enable/disable authentication (default: true) AUTH_KEYS_FILE - Path to keys file (default: $DATA_DIR/api_keys.txt) MAX_REQUESTS_PER_MINUTE - Rate limit per key_id (default: 100)

Keys File Format:

key_id:api_key

Example

production:sk-prod-abc123def456 alice-laptop:sk-alice-xyz789 development:sk-dev-test123

class auth.APIKeyValidator[source]

Bases: object

Validates API keys for incoming requests.

Features: - File-based configuration (key_id:api_key format) - Rate limiting per key_id - Key format validation - Audit trail with key_id tracking

__init__()[source]

validate(headers: dict) → Tuple[bool, str][source]

Validate API key from request headers.

Parameters:: headers – Request headers (lowercase keys)
Returns:: (is_valid, key_id_or_error_message) - If valid: (True, key_id) - If invalid: (False, error_message)

get_metrics() → dict[source]

Get current rate limiter metrics per key_id.

Returns:: dict with request counts and limits per key_id

async auth.authenticate_request(writer: StreamWriter, headers: dict) → str | None[source]

Authenticate an incoming request.

If authentication fails, sends a 401 response and returns None. If authentication succeeds, returns the key_id.

Parameters:

writer – asyncio StreamWriter to send response
headers – Request headers (lowercase keys)

Returns:

key_id if valid, None if invalid (401 response already sent)

async auth.send_rate_limit_error(writer: StreamWriter)[source]: Send OpenAI-compatible rate limit error (429).

async auth.log_access(method: str, path: str, key_id: str, status_code: int)[source]

Log API access for auditing.

Logs to: /data/logs/api_access.log

Parameters:

method – HTTP method
path – Request path
key_id – The key identifier (e.g., “production”, “alice-laptop”)
status_code – HTTP status code

Health Server Module

health_server.py — Ultra-lightweight health check server for RunPod

This server runs on PORT_HEALTH (separate from the main gateway) and provides a minimal health check endpoint that doesn’t interact with the backend.

This prevents RunPod’s periodic health checks from keeping the worker “active” and allows proper scale-to-zero behavior in serverless environments.

Environment Variables:: PORT_HEALTH - Port to listen on (default: 8001)

class health_server.HealthHandler(request, client_address, server)[source]

Bases: BaseHTTPRequestHandler

Minimal health check handler - just returns 200 OK.

do_GET()[source]: Handle GET requests - all paths return 200.

log_message(format, *args)[source]: Suppress request logging to reduce noise.

health_server.main()[source]: Start the health server.

llama-gguf-inference Documentation

Quick Start

Features

API Reference

Gateway Module

Authentication Module

Health Server Module

Indices and tables