Self-Hosted Secure Remote Access to llama.cpp
Octelium enables you to seamlessly provide secure access to llama.cpp server via both the private client-based mode over WireGuard/QUIC as well as the public clientless secure access (read more about the clientless access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, Qwen, etc....
Here is a simple example where you can seamlessly deploy a llama.cpp server as a managed container and serve it as an Octelium Service (read more about managed containers here):
kind: Service
metadata:
name: llama
spec:
port: 8080
mode: HTTP
isPublic: true
config:
upstream:
container:
port: 8080
image: ghcr.io/ggml-org/llama.cpp:server
args:
- "--model-url"
- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "-c"
- "2048"
resourceLimit:
cpu:
millicores: 2000
memory:
megabytes: 3000If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:
kind: Service
metadata:
name: llama
spec:
port: 8080
mode: HTTP
isPublic: true
config:
upstream:
container:
port: 8080
image: ghcr.io/ggml-org/llama.cpp:server-cuda
args:
- "--model-url"
- "https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q8_0.gguf"
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
- "-c"
- "2048"
resourceLimit:
cpu:
millicores: 2000
memory:
megabytes: 3000
ext:
# Change this according to your Kubernetes cluster available values
nvidia.com/gpu: "1"You can now apply the creation of the Service as follows (read more here):
octeliumctl apply /PATH/TO/SERVICE.YAMLNow you can access the llama.cpp server as an ordinary OpenAI-compliant HTTP server either via the client-based mode via octelium connect or via the clientless mode via OAuth2 client credentials or directly via bearer access tokens. In the client-based access mode, Users can simply access the Service at its private host http://llama:8080.
As for the clientless mode, WORKLOAD Users can access the Service via the standard OAuth2 client credentials in order for your workloads that can be written in any programming language to access the Service without having to use any special SDKs or have access to external clients All you need is to create an OAUTH2 Credential as illustrated here. Now, here is an example written in Typescript:
import OpenAI from "openai";
import { OAuth2Client } from "@badgateway/oauth2-client";
async function main() {
const oauth2Client = new OAuth2Client({
server: "https://<DOMAIN>/",
clientId: "spxg-cdyx",
clientSecret: "AQpAzNmdEcPIfWYR2l2zLjMJm....",
tokenEndpoint: "/oauth2/token",
authenticationMethod: "client_secret_post",
});
const oauth2Creds = await oauth2Client.clientCredentials();
const client = new OpenAI({
apiKey: oauth2Creds.accessToken,
baseURL: "https://llama.<DOMAIN>",
});
const chatCompletion = await client.chat.completions.create({
messages: [
{ role: "user", content: "How do I write a Golang HTTP reverse proxy?" },
],
model: "qwen3:0.6b",
});
console.log("Result", chatCompletion);
}Regarding authentication for HUMAN Users (read more here), you can seamlessly authenticate to the Cluster via your web browser using an IdentityProvider (read more here). There currently 3 types:
GitHub OAuth IdentityProvider as shown in detail here
OpenID Connect IdentityProviders (e.g. Okta, Auth0, etc...) as shown here.
SAML 2.0 IdentityProviders (e.g. Okta, Entra ID, etc...) as shown here.
Furthermore, you can can register a FIDO2 Authenticator (e.g. Yubikeys) in order to directly login later via Passkey (read more here) without having to use an IdentityProvider.
Octelium also provides OpenTelemetry-ready, application-layer L7 aware visibility and access logging in real time (see an example for HTTP here). You can read more about visibility here.
Here are a few more features that you might be interested in:
Request/response header manipulation (read more here).
Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
Exposing the API publicly for anonymous access (read more here).
OpenTelemetry-ready, application-layer L7 aware auditing and visibility (read more here).