Building a Modern On-Premises OpenShift Platform: GitOps, Operators, Zero Trust, Observability and AI

Building a Modern On-Premises OpenShift Platform

A while back, a customer came to us with a familiar tension: they wanted the speed of the cloud — self-service, automation, ship whenever you’re ready — but their data wasn’t allowed to leave the building. Regulated industry, strict sovereignty rules, auditors who ask hard questions.

So we built them an OpenShift platform that lives entirely inside their own datacenter, yet still feels like a modern cloud to the people using it. GitOps, CI/CD, single sign-on, observability, multi-cluster governance, even model serving for the data-science team — all of it running on hardware they own and control.

This post walks through how the pieces fit together, and why we made the calls we did along the way.

LEMINNOV OpenShift Enterprise Platform Architecture


1. Why OpenShift Instead of Plain Kubernetes?

Kubernetes vs OpenShift

We get asked this a lot: why not just run upstream Kubernetes and skip the licensing? It’s a fair question, and for a small team with deep Kubernetes muscle it can be the right answer. But on a platform that several teams lean on, the work doesn’t stop at kubectl apply. Someone still has to wire in authentication, monitoring, logging, a registry and an upgrade path — and then keep all of it patched, forever. OpenShift ships those pieces already integrated and supported, which means our time goes into the platform we’re building instead of the plumbing under it.

Here’s how the two compare in the day-to-day:

Capability Kubernetes OpenShift
Installation Manual or distribution-dependent Automated and enterprise-supported
Security Add-ons required Built-in SCC, integrated OAuth, security policies
Monitoring Third-party stack required Integrated Prometheus stack
Logging Third-party stack required OpenShift Logging / Loki
CI/CD External tooling OpenShift Pipelines / Tekton
GitOps External install OpenShift GitOps / Argo CD
Operators Optional Native OLM ecosystem
Upgrades Manual planning CVO-managed lifecycle
Support Community/vendor-dependent Red Hat enterprise support

2. CI/CD with Tekton Pipelines

Tekton CI/CD Pipeline

Our build pipelines run on Tekton, so CI/CD is just more Kubernetes — same API, same RBAC, same kubectl, no separate build server to babysit. The flow itself is the one you’d expect: pull the code, build the image, push it to the registry, roll it out to the cluster.

apiVersion: tekton.dev/v1
kind: Pipeline
metadata:
  name: build-and-deploy
  namespace: cicd
spec:
  workspaces:
    - name: shared-workspace

  tasks:
    - name: clone-source
      taskRef:
        name: git-clone
      workspaces:
        - name: output
          workspace: shared-workspace

    - name: build-and-push
      runAfter:
        - clone-source
      taskRef:
        name: buildah
      workspaces:
        - name: source
          workspace: shared-workspace

    - name: deploy-to-openshift
      runAfter:
        - build-and-push
      taskRef:
        name: openshift-client

Example PipelineRun:

apiVersion: tekton.dev/v1
kind: PipelineRun
metadata:
  generateName: build-and-deploy-
  namespace: cicd
spec:
  pipelineRef:
    name: build-and-deploy
  workspaces:
    - name: shared-workspace
      persistentVolumeClaim:
        claimName: pipeline-workspace

3. GitOps with Argo CD

GitOps Factory

From there, Git takes over as the source of truth. Argo CD keeps an eye on both the repository and the cluster at the same time, and whenever the two drift apart it quietly pulls the cluster back in line with what’s in Git.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-monitoring
  namespace: openshift-gitops
spec:
  project: default

  source:
    repoURL: https://gitlab.example.com/platform/gitops.git
    targetRevision: main
    path: clusters/production/monitoring

  destination:
    server: https://kubernetes.default.svc
    namespace: openshift-monitoring

  syncPolicy:
    automated:
      prune: true
      selfHeal: true

The payoff is mostly peace of mind. Nobody touches production by hand, so there are no mystery edits to reverse-engineer at 2 a.m. When something does look off, the Git history says exactly what changed and when — and rolling back is a git revert, not a rescue mission. Because the same manifests rebuild the same environment every time, audits turn into a fairly boring afternoon, which is the highest compliment we can pay them.


4. Multi-Cluster Management with ACM

ACM Multi-Cluster Management

One cluster is easy. The trouble starts at three or four, when each one quietly grows its own snowflake config and you lose track of which is which. Red Hat Advanced Cluster Management gives us a single control plane over all of them — the on-prem clusters alongside whatever runs in the public cloud:

  • AWS
  • Azure
  • On-Premises datacenter
  • Edge location
  • DR site

Example MultiClusterHub:

apiVersion: operator.open-cluster-management.io/v1
kind: MultiClusterHub
metadata:
  name: multiclusterhub
  namespace: open-cluster-management
spec:
  availabilityConfig: High

Example governance policy:

apiVersion: policy.open-cluster-management.io/v1
kind: Policy
metadata:
  name: restrict-image-registries
  namespace: policies
spec:
  remediationAction: enforce
  disabled: false
  policy-templates:
    - objectDefinition:
        apiVersion: policy.open-cluster-management.io/v1
        kind: ConfigurationPolicy
        metadata:
          name: restrict-image-registries
        spec:
          remediationAction: enforce
          severity: high
          object-templates:
            - complianceType: musthave
              objectDefinition:
                apiVersion: config.openshift.io/v1
                kind: Image
                metadata:
                  name: cluster
                spec:
                  registrySources:
                    allowedRegistries:
                      - registry.redhat.io
                      - quay.io
                      - image-registry.openshift-image-registry.svc:5000

The governance side is what we lean on most. A policy written once on the hub is enforced everywhere, so a rule like “only pull images from registries we actually trust” can’t quietly lapse on the one cluster nobody’s looking at this week.


5. Why Everything Is Managed as Operators and Code

OpenShift Operators OLM CVO Architecture

Early on we set ourselves one rule: nothing gets installed by hand. Every capability on the platform is described as code and reconciled by something — Argo CD for our own workloads, the Operator Lifecycle Manager (OLM) for the operators, and the Cluster Version Operator (CVO) for OpenShift’s own components. The operators then take it from there, quietly managing the day-two details of the things they own.

Example operator subscription:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: openshift-pipelines-operator
  namespace: openshift-operators
spec:
  channel: latest
  name: openshift-pipelines-operator-rh
  source: redhat-operators
  sourceNamespace: openshift-marketplace
  installPlanApproval: Automatic

It sounds strict, and it is. But the alternative — a wiki page titled “How to install X” that went stale two versions ago — is exactly how platforms rot. With everything declared, a rebuild is reproducible, upgrades stop being adventures, and when someone asks who changed what, the answer is already sitting in the commit log.


6. OpenShift AI and MLOps

OpenShift AI MLOps

The data-science team used to hand us a model file and a hopeful look; getting it actually served in production was a small project of its own. OpenShift AI shortened that path a lot. They keep developing in notebooks like before, then serve the model through KServe with a short manifest — no bespoke Flask app, no hand-rolled container, no ticket to us in the middle.

Example InferenceService:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detection
  namespace: ai
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detection
    resources:
      limits:
        cpu: "2"
        memory: 4Gi

7. Observability Stack

You can’t operate what you can’t see, so observability went in on day one rather than after the first ugly outage. The stack is the usual open-source crew, each piece doing one job well:

  • Prometheus for metrics
  • Loki for logs
  • Tempo for traces
  • Grafana for dashboards (deployed with the Grafana Operator)
  • Alertmanager for alerts
  • Thanos for long-term metrics storage

OpenShift Observability Stack

Example ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: application-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
    - port: http
      path: /metrics
      interval: 30s

8. Zero Trust Access with Cloudflare

Nothing on this platform is reachable from the open internet. There are no public load balancers and no inbound ports to scan, which already removes a whole category of problems.

authentication

Instead, a cloudflared tunnel dials out to Cloudflare, and every request comes back in through Cloudflare Access, which handles SSO and MFA before traffic ever reaches a pod. If you’re not authenticated, you don’t even get to knock.

Network isolation is only half the story, though. Inside the cluster, OpenShift constrains how workloads run through Security Context Constraints (SCCs) — no privileged containers, no privilege escalation, no host namespaces, and capabilities dropped by default. So even a container that gets compromised has very little room to move.

apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: leminnov-restricted
allowPrivilegedContainer: false
allowPrivilegeEscalation: false
allowHostNetwork: false
allowHostPID: false
allowHostIPC: false
readOnlyRootFilesystem: false

requiredDropCapabilities:
  - ALL

runAsUser:
  type: MustRunAsRange

seLinuxContext:
  type: MustRunAs

fsGroup:
  type: MustRunAs

supplementalGroups:
  type: RunAsAny

Network Policies enforce communication boundaries between workloads and namespaces. Because Cloudflare Tunnel terminates at the in-cluster cloudflared pod rather than at an exposed inbound port, application traffic originates from that pod. The policy below therefore admits ingress only from the cloudflared workload, not from external IP ranges.

Example NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-cloudflared-ingress
  namespace: myapp
spec:
  podSelector:
    matchLabels:
      app: myapp
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: cloudflare
          podSelector:
            matchLabels:
              app: cloudflared
      ports:
        - protocol: TCP
          port: 8080

No single one of these layers is enough on its own. Put together — no inbound exposure, authenticated access, tightly constrained workloads, and network paths spelled out explicitly — they mean one mistake rarely turns into a breach. That’s really the whole point.


Lessons that stuck

If we had to boil the whole project down to a few things we’d tell our past selves, it’s these three.

Declare everything. Anything installed by hand will drift, and it’ll pick the worst possible moment to break. Lean on operators — they turn the tedious, error-prone parts of lifecycle management into someone else’s solved problem, and that someone is usually Red Hat. And treat Git as the operating model rather than just a place to park YAML: it’s the audit trail, the rollback button and the single version of the truth, all at once. None of this is glamorous, but it’s what lets a small team run a serious platform without living in firefighting mode.


Conclusion

In the end, what we delivered isn’t really “an OpenShift cluster” — it’s a way of working. The teams that use it ship faster because the boring parts are automated, and they sleep better because the guardrails are real, not a slide. It happens to live in a private datacenter, but day to day it doesn’t feel like it.

If you’re caught in the same tug-of-war between control and speed, that’s exactly the kind of thing we enjoy building. Come talk to us — we’re happy to compare notes.

LEMINNOV
Cloud Native • OpenShift • DevSecOps • AI • Platform Engineering

Semantic Routing Under the Microscope: Tracing Latency with OpenTelemetry and Coroot