Self-Hosted Secure Remote to llama.cpp

ManagementGuideService ExamplesAI

Octelium enables you to seamlessly provide secure access to llama.cpp server via both the private client-based mode over WireGuard/QUIC as well as the public clientless secure access (read more about the clientless access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, Qwen, etc....

Here is a simple example where you can seamlessly deploy a llama.cpp server as a managed container and serve it as an Octelium Service (read more about managed containers here):

1kind: Service
2metadata:
3  name: llama
4spec:
5  port: 8080
6  mode: HTTP
7  isPublic: true
8  config:
9    upstream:
10      container:
11        port: 8080
12        image: ghcr.io/ggml-org/llama.cpp:server
13        args:
14        - "--model-url"
15        - "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
16        - "--host"
17        - "0.0.0.0"
18        - "--port"
19        - "8080"
20        - "-c"
21        - "2048"
22        resourceLimit:
23          cpu:
24            millicores: 2000
25          memory:
26            megabytes: 3000

If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:

1kind: Service
2metadata:
3  name: llama
4spec:
5  port: 8080
6  mode: HTTP
7  isPublic: true
8  config:
9    upstream:
10      container:
11        port: 8080
12        image: ghcr.io/ggml-org/llama.cpp:server-cuda
13        args:
14        - "--model-url"
15        - "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
16        - "--host"
17        - "0.0.0.0"
18        - "--port"
19        - "8080"
20        - "-c"
21        - "2048"
22        resourceLimit:
23          cpu:
24            millicores: 2000
25          memory:
26            megabytes: 3000
27          ext:
28            # Change this according to your Kubernetes cluster available values
29            nvidia.com/gpu: "1"

You can now apply the creation of the Service as follows (read more here):

octeliumctl apply /PATH/TO/SERVICE.YAML

Now you can access the llama.cpp server as an ordinary OpenAI-compliant HTTP server either via the client-based mode via octelium connect or via the clientless mode via OAuth2 client credentials or directly via bearer access tokens. In the client-based access mode, Users can simply access the Service at its private host http://llama:8080.

As for the clientless mode, WORKLOAD Users can access the Service via the standard OAuth2 client credentials in order for your workloads that can be written in any programming language to access the Service without having to use any special SDKs or have access to external clients All you need is to create an OAUTH2 Credential as illustrated here. Now, here is an example written in Typescript:

1import OpenAI from "openai";
2
3import { OAuth2Client } from "@badgateway/oauth2-client";
4
5async function main() {
6  const oauth2Client = new OAuth2Client({
7    server: "https://<DOMAIN>/",
8    clientId: "spxg-cdyx",
9    clientSecret: "AQpAzNmdEcPIfWYR2l2zLjMJm....",
10    tokenEndpoint: "/oauth2/token",
11    authenticationMethod: "client_secret_post",
12  });
13
14  const oauth2Creds = await oauth2Client.clientCredentials();
15
16  const client = new OpenAI({
17    apiKey: oauth2Creds.accessToken,
18    baseURL: "https://llama.<DOMAIN>",
19  });
20
21  const chatCompletion = await client.chat.completions.create({
22    messages: [
23      { role: "user", content: "How do I write a Golang HTTP reverse proxy?" },
24    ],
25    model: "qwen3:0.6b",
26  });
27
28  console.log("Result", chatCompletion);
29}

Octelium also provides OpenTelemetry-ready, application-layer L7 aware visibility and access logging in real time (see an example for HTTP here). You can read more about visibility here.

Here are a few more features that you might be interested in:

Request/response header manipulation (read more here).
Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
Exposing the API publicly for anonymous access (read more here).
OpenTelemetry-ready, application-layer L7 aware auditing and visibility (read more here).