Self-Hosted Remote Ollama

Octelium enables you to seamlessly provide secure access to Ollama via both the private client-based mode over WireGuard/QUIC as well as the public client-less secure access (read more about the client-less BeyondCorp access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, etc....

Here is a simple example where you can seamlessly deploy an Ollama server as a managed container and serve it as an Octelium Service (read more about managed containers here):

1
kind: Service
2
metadata:
3
name: ollama
4
spec:
5
port: 11434
6
mode: HTTP
7
isPublic: true
8
config:
9
upstream:
10
container:
11
port: 11434
12
image: ollama/ollama
13
resourceLimit:
14
cpu:
15
millicores: 3000
16
memory:
17
megabytes: 4000

The above configuration is simply a CPU-only mode. If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:

1
kind: Service
2
metadata:
3
name: ollama
4
spec:
5
port: 11434
6
mode: HTTP
7
isPublic: true
8
config:
9
upstream:
10
container:
11
port: 11434
12
image: ollama/ollama
13
resourceLimit:
14
cpu:
15
millicores: 3000
16
memory:
17
megabytes: 4000
18
ext:
19
# Change this according to your Kubernetes cluster available values
20
nvidia.com/gpu: "1"

You can now apply the creation of the Service as follows (read more here):

octeliumctl apply /PATH/TO/SERVICE.YAML

You can also serve an Ollama server that's served by a connected User (read more here) as follows:

1
kind: Service
2
metadata:
3
name: ollama
4
spec:
5
port: 11434
6
mode: HTTP
7
isPublic: true
8
config:
9
upstream:
10
url: http://localhost:11434
11
user: ollama-server

To inform the Ollama client (you can download the client here) with the address of our Service running the Ollama server, you need to set the OLLAMA_HOST environment variable to the address of the Service. For client-based access (read more about connecting to the Cluster here), you need to set the environment variable as follows:

export OLLAMA_HOST=ollama:11434

Now, from your machine, you can run a model like Gemma3 as follows:

ollama run gemma3:1b

You can also access Ollama via the client-less BeyondCorp mode from within your applications written in programming language without having to use the octelium client or any special SDK. For example, you can use the OAuth2 client credential mode (read more here) and use its bearer access token as in the following Python code example:

1
from ollama import Client
2
3
client = Client(
4
host='https://ollama.<DOMAIN>',
5
headers={'authorization': 'Bearer XXXX'}
6
)
7
8
response = client.chat(model='llama3.2', messages=[
9
{
10
'role': 'user',
11
'content': 'What is Octelium?',
12
},
13
])
NOTE

Alternatively to the OAuth2 client credentials flow Credential, you can also generate an access token Credential and use it directly as a bearer token. Read more here.

Here are a few more features that you might be interested in:

  • Request/response header manipulation (read more here).
  • Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
  • Exposing the API publicly for anonymous access (read more here).
© 2025 octelium.comOctelium Labs, LLCAll rights reserved
Octelium and Octelium logo are trademarks of Octelium Labs, LLC.
WireGuard is a registered trademark of Jason A. Donenfeld