llama-gguf-inference Documentation

GGUF model inference server using llama.cpp

Welcome to the documentation for llama-gguf-inference, a production-ready inference server for GGUF models.

Quick Start

# Run with Docker
docker run --gpus all \
  -v /path/to/models:/data/models \
  -e MODEL_NAME=your-model.gguf \
  -p 8000:8000 \
  ghcr.io/zepfu/llama-gguf-inference

Features

  • OpenAI-compatible API - Drop-in replacement

  • Authentication - API key-based access control

  • Streaming support - Real-time token streaming

  • Health monitoring - Separate health check port

  • Platform agnostic - Works anywhere Docker runs

API Reference

Gateway Module

gateway.py – Async HTTP gateway for llama.cpp llama-server

Features: - API key authentication with health endpoint exemption - Proper SSE/streaming support for chat completions - /ping and /health endpoints (no auth required) - Request timeout handling - Basic metrics tracking - Graceful error handling - Per-key access logging

Note: /ping and /health return 200 immediately without backend checks or authentication to enable scale-to-zero in serverless environments.

Environment Variables:

GATEWAY_PORT - Port to listen on (default: 8000) BACKEND_HOST - llama-server host (default: 127.0.0.1) PORT_BACKEND - llama-server port (default: 8080) BACKEND_API_KEY - API key for backend authentication (optional) REQUEST_TIMEOUT - Max request time in seconds (default: 300) HEALTH_TIMEOUT - Health check timeout in seconds (default: 2) AUTH_ENABLED - Enable API key authentication (default: true) AUTH_KEYS_FILE - Path to API keys file (default: $DATA_DIR/api_keys.txt)

gateway.log(msg: str)[source]

Simple logging to stderr.

class gateway.Metrics(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>)[source]

Bases: object

requests_total: int = 0
requests_success: int = 0
requests_error: int = 0
requests_active: int = 0
requests_authenticated: int = 0
requests_unauthorized: int = 0
bytes_sent: int = 0
start_time: float
to_dict()[source]
__init__(requests_total: int = 0, requests_success: int = 0, requests_error: int = 0, requests_active: int = 0, requests_authenticated: int = 0, requests_unauthorized: int = 0, bytes_sent: int = 0, start_time: float = <factory>) None
gateway.backend_tcp_ready() bool[source]

Check if backend is accepting TCP connections.

async gateway.backend_health_check() dict[source]

Check backend health via /health endpoint.

Returns health status dict or error.

async gateway.handle_ping(writer: StreamWriter)[source]

Handle /ping endpoint for RunPod health checks.

Always returns 200 OK without authentication or backend checks. For detailed backend status, use /health endpoint instead.

async gateway.handle_health(writer: StreamWriter)[source]

Handle /health endpoint with detailed backend status.

No authentication required.

async gateway.handle_metrics(writer: StreamWriter)[source]

Handle /metrics endpoint.

No authentication required.

async gateway.proxy_request(method: str, path: str, headers: dict, body: bytes | None, writer: StreamWriter, key_id: str = 'unknown')[source]

Proxy a request to the backend with streaming support.

Parameters:
  • method – HTTP method (GET, POST, etc.)

  • path – Request path to forward to backend

  • headers – Request headers dict (lowercase keys)

  • body – Request body bytes, or None for bodyless requests

  • writer – asyncio StreamWriter for the client connection

  • key_id – The authenticated key_id for logging

async gateway.handle_client(reader: StreamReader, writer: StreamWriter)[source]

Handle an incoming client connection.

async gateway.main()[source]

Start the gateway server.

Authentication Module

auth.py - API Key Authentication Module for Gateway

File-based authentication system that enforces API keys while maintaining OpenAI compatibility. Uses key_id:api_key format for easy management and auditing.

Usage:

from auth import api_validator, authenticate_request, log_access

# In your request handler: api_key_info = await authenticate_request(writer, headers) if api_key_info is None:

# 401 response already sent return

key_id = api_key_info # The key_id for logging/tracking

# Process request… await log_access(method, path, key_id, status_code)

Environment Variables:

AUTH_ENABLED - Enable/disable authentication (default: true) AUTH_KEYS_FILE - Path to keys file (default: $DATA_DIR/api_keys.txt) MAX_REQUESTS_PER_MINUTE - Rate limit per key_id (default: 100)

Keys File Format:

key_id:api_key

Example

production:sk-prod-abc123def456 alice-laptop:sk-alice-xyz789 development:sk-dev-test123

class auth.APIKeyValidator[source]

Bases: object

Validates API keys for incoming requests.

Features: - File-based configuration (key_id:api_key format) - Rate limiting per key_id - Key format validation - Audit trail with key_id tracking

__init__()[source]
validate(headers: dict) Tuple[bool, str][source]

Validate API key from request headers.

Parameters:

headers – Request headers (lowercase keys)

Returns:

(is_valid, key_id_or_error_message) - If valid: (True, key_id) - If invalid: (False, error_message)

get_metrics() dict[source]

Get current rate limiter metrics per key_id.

Returns:

dict with request counts and limits per key_id

async auth.authenticate_request(writer: StreamWriter, headers: dict) str | None[source]

Authenticate an incoming request.

If authentication fails, sends a 401 response and returns None. If authentication succeeds, returns the key_id.

Parameters:
  • writer – asyncio StreamWriter to send response

  • headers – Request headers (lowercase keys)

Returns:

key_id if valid, None if invalid (401 response already sent)

async auth.send_rate_limit_error(writer: StreamWriter)[source]

Send OpenAI-compatible rate limit error (429).

async auth.log_access(method: str, path: str, key_id: str, status_code: int)[source]

Log API access for auditing.

Logs to: /data/logs/api_access.log

Format: ISO8601_timestamp | key_id | method path | status_code Example: 2024-02-06T14:30:22.123456 | production | POST /v1/chat/completions | 200

Parameters:
  • method – HTTP method

  • path – Request path

  • key_id – The key identifier (e.g., “production”, “alice-laptop”)

  • status_code – HTTP status code

Health Server Module

health_server.py — Ultra-lightweight health check server for RunPod

This server runs on PORT_HEALTH (separate from the main gateway) and provides a minimal health check endpoint that doesn’t interact with the backend.

This prevents RunPod’s periodic health checks from keeping the worker “active” and allows proper scale-to-zero behavior in serverless environments.

Environment Variables:

PORT_HEALTH - Port to listen on (default: 8001)

class health_server.HealthHandler(request, client_address, server)[source]

Bases: BaseHTTPRequestHandler

Minimal health check handler - just returns 200 OK.

do_GET()[source]

Handle GET requests - all paths return 200.

log_message(format, *args)[source]

Suppress request logging to reduce noise.

health_server.main()[source]

Start the health server.

Indices and tables