llama-gguf-inference Documentation =================================== **GGUF model inference server using llama.cpp** Welcome to the documentation for llama-gguf-inference, a production-ready inference server for GGUF models. .. toctree:: :maxdepth: 2 :caption: Contents: ARCHITECTURE api/gateway api/auth api/health_server Quick Start ----------- .. code-block:: bash # Run with Docker docker run --gpus all \ -v /path/to/models:/data/models \ -e MODEL_NAME=your-model.gguf \ -p 8000:8000 \ ghcr.io/zepfu/llama-gguf-inference Features -------- * **OpenAI-compatible API** - Drop-in replacement * **Authentication** - API key-based access control * **Streaming support** - Real-time token streaming * **Health monitoring** - Separate health check port * **Platform agnostic** - Works anywhere Docker runs API Reference ------------- Gateway Module ~~~~~~~~~~~~~~ .. automodule:: gateway :members: :undoc-members: :show-inheritance: Authentication Module ~~~~~~~~~~~~~~~~~~~~~ .. automodule:: auth :members: :undoc-members: :show-inheritance: Health Server Module ~~~~~~~~~~~~~~~~~~~~ .. automodule:: health_server :members: :undoc-members: :show-inheritance: Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`