Self-Hosted Remote Ollama

Octelium enables you to seamlessly provide secure access to Ollama via both the private client-based mode over WireGuard/QUIC as well as the public client-less secure access (read more about the client-less BeyondCorp access mode here) for lightweight open models such as Google's Gemma 3, DeepSeek R1, Meta's Llama, etc....

Here is a simple example where you can seamlessly deploy an Ollama server as a managed container and serve it as an Octelium Service (read more about managed containers here):

1kind: Service
2metadata:
3  name: ollama
4spec:
5  port: 11434
6  mode: HTTP
7  isPublic: true
8  config:
9    upstream:
10      container:
11        port: 11434
12        image: ollama/ollama
13        resourceLimit:
14          cpu:
15            millicores: 3000
16          memory:
17            megabytes: 4000

The above configuration is simply a CPU-only mode. If your underlying Kubernetes installation supports requesting and scheduling GPUs (read more here), you can modify the above the configuration to be as follows:

1kind: Service
2metadata:
3  name: ollama
4spec:
5  port: 11434
6  mode: HTTP
7  isPublic: true
8  config:
9    upstream:
10      container:
11        port: 11434
12        image: ollama/ollama
13        resourceLimit:
14          cpu:
15            millicores: 3000
16          memory:
17            megabytes: 4000
18          ext:
19            # Change this according to your Kubernetes cluster available values
20            nvidia.com/gpu: "1"

You can now apply the creation of the Service as follows (read more here):

octeliumctl apply /PATH/TO/SERVICE.YAML

You can also serve an Ollama server that's served by a connected User (read more here) as follows:

1kind: Service
2metadata:
3  name: ollama
4spec:
5  port: 11434
6  mode: HTTP
7  isPublic: true
8  config:
9    upstream:
10      url: http://localhost:11434
11      user: ollama-server

To inform the Ollama client (you can download the client here) with the address of our Service running the Ollama server, you need to set the OLLAMA_HOST environment variable to the address of the Service. For client-based access (read more about connecting to the Cluster here), you need to set the environment variable as follows:

export OLLAMA_HOST=ollama:11434

Now, from your machine, you can run a model like Gemma3 as follows:

ollama run gemma3:1b

You can also access Ollama via the client-less BeyondCorp mode from within your applications written in programming language without having to use the octelium client or any special SDK. For example, you can use the OAuth2 client credential mode (read more here) and use its bearer access token as in the following Python code example:

1from ollama import Client
2
3client = Client(
4  host='https://ollama.<DOMAIN>',
5  headers={'authorization': 'Bearer XXXX'}
6)
7
8response = client.chat(model='llama3.2', messages=[
9  {
10    'role': 'user',
11    'content': 'What is Octelium?',
12  },
13])

NOTE

Alternatively to the OAuth2 client credentials flow Credential, you can also generate an access token Credential and use it directly as a bearer token. Read more here.

Octelium also provides OpenTelemetry-ready, application-layer L7 aware visibility and access logging in real time (see an example for HTTP here). You can read more about visibility here.

Here are a few more features that you might be interested in:

Request/response header manipulation (read more here).
Application layer-aware ABAC access control via policy-as-code using CEL and Open Policy Agent (read more here).
Exposing the API publicly for anonymous access (read more here).
OpenTelemetry-ready, application-layer L7 aware auditing and visibility (read more here).