Octelium enables you to seamlessly provide secure access to llama.cpp server via both the private client-based mode over WireGuard/QUIC as well as the public clientless secure access (read more about the clientless access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, Qwen, etc....
Here is a simple example where you can seamlessly deploy a llama.cpp server as a managed container and serve it as an Octelium Service (read more about managed containers here):
1kind: Service2metadata:3name: llama4spec:5port: 80806mode: HTTP7isPublic: true8config:9upstream:10container:11port: 808012image: ghcr.io/ggml-org/llama.cpp:server13args:14- "--model-url"15- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"16- "--host"17- "0.0.0.0"18- "--port"19- "8080"20- "-c"21- "2048"22resourceLimit:23cpu:24millicores: 200025memory:26megabytes: 3000
If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:
1kind: Service2metadata:3name: llama4spec:5port: 80806mode: HTTP7isPublic: true8config:9upstream:10container:11port: 808012image: ghcr.io/ggml-org/llama.cpp:server-cuda13args:14- "--model-url"15- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"16- "--host"17- "0.0.0.0"18- "--port"19- "8080"20- "-c"21- "2048"22resourceLimit:23cpu:24millicores: 200025memory:26megabytes: 300027ext:28# Change this according to your Kubernetes cluster available values29nvidia.com/gpu: "1"
You can now apply the creation of the Service as follows (read more here):
octeliumctl apply /PATH/TO/SERVICE.YAML
Now you can access the llama.cpp server as an ordinary OpenAI-compliant HTTP server either via the client-based mode via octelium connect or via the clientless mode via OAuth2 client credentials or directly via bearer access tokens. In the client-based access mode, Users can simply access the Service at its private host http://llama:8080.
As for the clientless mode, WORKLOAD Users can access the Service via the standard OAuth2 client credentials in order for your workloads that can be written in any programming language to access the Service without having to use any special SDKs or have access to external clients All you need is to create an OAUTH2 Credential as illustrated here. Now, here is an example written in Typescript:
1import OpenAI from "openai";23import { OAuth2Client } from "@badgateway/oauth2-client";45async function main() {6const oauth2Client = new OAuth2Client({7server: "https://<DOMAIN>/",8clientId: "spxg-cdyx",9clientSecret: "AQpAzNmdEcPIfWYR2l2zLjMJm....",10tokenEndpoint: "/oauth2/token",11authenticationMethod: "client_secret_post",12});1314const oauth2Creds = await oauth2Client.clientCredentials();1516const client = new OpenAI({17apiKey: oauth2Creds.accessToken,18baseURL: "https://llama.<DOMAIN>",19});2021const chatCompletion = await client.chat.completions.create({22messages: [23{ role: "user", content: "How do I write a Golang HTTP reverse proxy?" },24],25model: "qwen3:0.6b",26});2728console.log("Result", chatCompletion);29}
Octelium also provides OpenTelemetry-ready, application-layer L7 aware visibility and access logging in real time (see an example for HTTP here). You can read more about visibility here.
Here are a few more features that you might be interested in:
- Request/response header manipulation (read more here).
- Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
- Exposing the API publicly for anonymous access (read more here).
- OpenTelemetry-ready, application-layer L7 aware auditing and visibility (read more here).