ManagementGuideService ExamplesAI
Self-Hosted Secure Remote to llama.cpp

Octelium enables you to seamlessly provide secure access to llama.cpp server via both the private client-based mode over WireGuard/QUIC as well as the public clientless secure access (read more about the clientless access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, Qwen, etc....

Here is a simple example where you can seamlessly deploy a llama.cpp server as a managed container and serve it as an Octelium Service (read more about managed containers here):

1
kind: Service
2
metadata:
3
name: llama
4
spec:
5
port: 8080
6
mode: HTTP
7
isPublic: true
8
config:
9
upstream:
10
container:
11
port: 8080
12
image: ghcr.io/ggml-org/llama.cpp:server
13
args:
14
- "--model-url"
15
- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
16
- "--host"
17
- "0.0.0.0"
18
- "--port"
19
- "8080"
20
- "-c"
21
- "2048"
22
resourceLimit:
23
cpu:
24
millicores: 2000
25
memory:
26
megabytes: 3000

If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:

1
kind: Service
2
metadata:
3
name: llama
4
spec:
5
port: 8080
6
mode: HTTP
7
isPublic: true
8
config:
9
upstream:
10
container:
11
port: 8080
12
image: ghcr.io/ggml-org/llama.cpp:server-cuda
13
args:
14
- "--model-url"
15
- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
16
- "--host"
17
- "0.0.0.0"
18
- "--port"
19
- "8080"
20
- "-c"
21
- "2048"
22
resourceLimit:
23
cpu:
24
millicores: 2000
25
memory:
26
megabytes: 3000
27
ext:
28
# Change this according to your Kubernetes cluster available values
29
nvidia.com/gpu: "1"

You can now apply the creation of the Service as follows (read more here):

octeliumctl apply /PATH/TO/SERVICE.YAML

Now you can access the llama.cpp server as an ordinary OpenAI-compliant HTTP server either via the client-based mode via octelium connect or via the clientless mode via OAuth2 client credentials or directly via bearer access tokens. In the client-based access mode, Users can simply access the Service at its private host http://llama:8080.

As for the clientless mode, WORKLOAD Users can access the Service via the standard OAuth2 client credentials in order for your workloads that can be written in any programming language to access the Service without having to use any special SDKs or have access to external clients All you need is to create an OAUTH2 Credential as illustrated here. Now, here is an example written in Typescript:

1
import OpenAI from "openai";
2
3
import { OAuth2Client } from "@badgateway/oauth2-client";
4
5
async function main() {
6
const oauth2Client = new OAuth2Client({
7
server: "https://<DOMAIN>/",
8
clientId: "spxg-cdyx",
9
clientSecret: "AQpAzNmdEcPIfWYR2l2zLjMJm....",
10
tokenEndpoint: "/oauth2/token",
11
authenticationMethod: "client_secret_post",
12
});
13
14
const oauth2Creds = await oauth2Client.clientCredentials();
15
16
const client = new OpenAI({
17
apiKey: oauth2Creds.accessToken,
18
baseURL: "https://llama.<DOMAIN>",
19
});
20
21
const chatCompletion = await client.chat.completions.create({
22
messages: [
23
{ role: "user", content: "How do I write a Golang HTTP reverse proxy?" },
24
],
25
model: "qwen3:0.6b",
26
});
27
28
console.log("Result", chatCompletion);
29
}

Octelium also provides OpenTelemetry-ready, application-layer L7 aware visibility and access logging in real time (see an example for HTTP here). You can read more about visibility here.

Here are a few more features that you might be interested in:

  • Request/response header manipulation (read more here).
  • Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
  • Exposing the API publicly for anonymous access (read more here).
  • OpenTelemetry-ready, application-layer L7 aware auditing and visibility (read more here).
© 2026 octelium.comOctelium Labs, LLCAll rights reserved
Octelium and Octelium logo are trademarks of Octelium Labs, LLC.
WireGuard is a registered trademark of Jason A. Donenfeld