Skip to content
Pipelines and Pizza 🍕
Go back

Deploying Loki on Kubernetes: SimpleScalable on Nutanix Objects

13 min read

Table of Contents

Open Table of Contents

Where We Left Off

Last article we covered the production Alloy DaemonSet — the agent collecting logs and metrics on every node of the cluster. The Loki pieces of that config — loki.source.kubernetes, loki.process, the dual-write loki.write endpoints — all point somewhere. This is the somewhere.

We’re going to deploy Loki on the same RKE2 cluster the Alloy DaemonSet runs on, configure it to land log chunks in a Nutanix Objects bucket, and walk through every Helm value that matters. By the end you’ll have Loki running and accepting writes from your Alloy DaemonSet. Next post (4b) covers the production side — labels, LogQL, retention policy, and the troubleshooting lessons.


Why SimpleScalable, Not Distributed

Loki has three deployment modes: monolithic, SimpleScalable, and fully distributed. The Grafana docs do a good job of explaining all three, but the choice for us came down to one fact: we were building net-new.

The arguments for fully distributed Loki — scale read and write paths independently, isolate per-component failures, fine-grained resource tuning — all matter at scale. They also assume you have somebody whose full-time job is operating Loki. We don’t. The team that runs this platform also runs Nutanix, the network, CNPG, and a dozen other things. Operational complexity has a real cost.

SimpleScalable splits Loki into three logical roles — write, read, and backend — each as its own StatefulSet, each scaled independently. You get the read/write path isolation that’s the main win of distributed mode without the proliferation of components. Ingester, distributor, querier, query-frontend, query-scheduler, compactor, index gateway, ruler — in SimpleScalable they’re collapsed into three pods running multiple Loki targets.

For our scale (single-digit terabytes of logs per year, internal-team query load) SimpleScalable was the right size. If we ever outgrow it, the Helm chart supports flipping deploymentMode: SimpleScalable to Distributed and rolling out — it’s not a one-way door.

The pizza analogy holds: a small neighborhood place runs a single oven with a few stations. A regional chain runs one specialized line per task. We’re the neighborhood place. The kitchen fits in a smaller building, the crew fits in a smaller shift, and the food is still good.


The Three Roles: Write, Read, Backend

The write path takes incoming log pushes from Alloy, indexes them, builds chunks in memory, and flushes the chunks to object storage. This is where the distributor and ingester components live in distributed mode. In SimpleScalable, they run together inside the write pod.

Alloy → loki.write → write StatefulSet → S3 (Nutanix Objects)

The read path serves LogQL queries. It fans out across the chunks in S3 and the in-memory chunks still on the write pods, merges results, returns them to the user. Querier, query-frontend, and query-scheduler all run inside the read pod.

Grafana → read StatefulSet → S3 + write pods (recent data) → response

The backend role is the bookkeeping: compactor, index gateway, ruler. It runs the background jobs that keep storage and retention honest. Two replicas in our config because the index gateway serves read traffic and we want it surviving a single pod loss without taking the query latency hit.

backend StatefulSet:
  - compactor (consolidates index files, applies retention)
  - index gateway (serves index lookups to read pods)
  - ruler (evaluates LogQL alerting and recording rules)

Three pods, three roles, scaled independently. If write volume jumps we add write replicas. If a long-range query starts dominating, we add read replicas. The backend mostly stays where it is.


Storage: Nutanix Objects via S3

The storage decision was already made in article 2. To recap the relevant bits for Loki:

  • We have a dedicated bucket on Nutanix Objects for Loki chunks.
  • A separate bucket holds ruler state (alerting and recording rule definitions persisted across restarts).
  • Credentials come from a Kubernetes secret, mounted as env vars, referenced from the Loki config via ${VAR} expansion (requires -config.expand-env=true).
  • TLS is real — no insecure_skip_verify. The Nutanix Objects endpoint serves a cert signed by our internal CA, which is mounted into every pod that talks to S3.

Loki connects via the S3 API. The endpoint is configured per data center in our values-east.yaml / values-west.yaml files because the local Objects endpoint differs per DC. The shared values-common.yaml doesn’t pin the URL — only the bucket layout, schema, and retention rules that don’t change between DCs.

One thing worth flagging if you’re coming from older Loki: the storage configuration moved between Loki 2.x and 3.x. The storage_config.named_stores block from Loki 2.x is gone — the chart will error out with field s3 not found in type storage.NamedStores if you try it. The Loki 3.x / chart v6.x approach is to put S3 config under loki.storage.s3 in the chart values. Our values-common.yaml even has an inline comment warning against the older syntax, because we tripped over it once.


Schema Configuration

The schema tells Loki how to organize its index. Our config:

loki:
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

A few things worth knowing:

  • store: tsdb — Loki’s modern index implementation. Replaces the older BoltDB-Shipper. Faster, more efficient on storage, supports structured metadata (which we’ll use in the next article for high-cardinality fields like trace IDs).
  • schema: v13 — Current. Required for structured metadata, smaller chunks, better compaction.
  • period: 24h — Index tables roll daily. Matches the compactor’s daily cadence. Don’t shorten this — you’ll end up with thousands of tiny index files that the compactor never quite catches up on.
  • from: "2024-01-01" — When the schema takes effect. For a brand-new deployment, anything in the past works. If you’re migrating from an older schema, set the from date in the future and Loki will use the old schema for historical data and the new schema after the cutover.

This is one of those configs you set once at deployment time and never touch again. Get it right on day one and you save yourself a multi-schema migration later.


Single-Tenant by Design

Our Loki runs with auth_enabled: false. Single-tenant.

This sometimes surprises people, because every Loki tutorial covers multi-tenancy as a major feature. Here’s why we’re not using it:

Multi-tenancy in Loki is for cases where you’re running Loki as a service for parties who shouldn’t see each other’s data — Grafana Cloud’s customers, an internal platform team serving multiple unrelated business units, a hosting provider. The tenant ID gets passed in the X-Scope-OrgID header on every read and write, and Loki keeps the data and resource limits isolated per tenant.

For us, Loki is the internal observability backend. Everyone with access to Grafana is on the same team or has the same operational relationship to the data. We don’t need to charge a tenant by ingestion rate. We don’t need to prevent team A from querying team B’s logs. We have one tenant: the platform.

Multi-tenancy has a cost too. Every query must include the header. Every Alloy push must include it. Cross-tenant queries require a special syntax. Per-tenant overrides require a runtime config file. If you don’t need the isolation, single-tenant is one fewer moving part.

If you do need it — say you’re hosting Loki for a couple of distinct teams with different retention requirements and you want them to truly not see each other — flip auth_enabled: true and configure X-Scope-OrgID on every Alloy loki.write block. The other 95% of the deployment doesn’t change.


The Helm Values We Run

Here is the production values-common.yaml, trimmed for readability:

loki:
  deploymentMode: SimpleScalable
  global:
    dnsService: rke2-coredns-rke2-coredns  # RKE2-specific
    extraArgs:
      - "-config.expand-env=true"           # Allow ${VAR} expansion for S3 creds

  loki:
    auth_enabled: false                     # Single-tenant
    schemaConfig:
      configs:
        - from: "2024-01-01"
          store: tsdb
          object_store: s3
          schema: v13
          index:
            prefix: loki_index_
            period: 24h
    storage:
      type: s3                              # Per-DC s3.endpoint in values-east.yaml
    limits_config:
      retention_period: 365d
      max_query_series: 10000
      max_query_parallelism: 32
      ingestion_rate_strategy: global       # NOT local — see note below
      ingestion_rate_mb: 50
      ingestion_burst_size_mb: 100
      per_stream_rate_limit: 10MB
      per_stream_rate_limit_burst: 30MB
      max_global_streams_per_user: 50000
    compactor:
      retention_enabled: true
      delete_request_store: s3
    ruler:
      storage:
        type: s3
        s3:
          endpoint: objects.example.com
          bucketnames: loki-ruler
      alertmanager_url: http://alertmanager.observability.svc:9093

  write:
    replicas: 3
    maxUnavailable: 1
    persistence:
      storageClass: local-path
      size: 20Gi
    resources:
      requests:
        cpu: 1000m
        memory: 2Gi
      limits:
        cpu: 4000m
        memory: 16Gi

  read:
    replicas: 3
    maxUnavailable: 1
    resources:
      requests:
        cpu: 1000m
        memory: 1Gi
      limits:
        cpu: 4000m
        memory: 4Gi

  backend:
    replicas: 2
    maxUnavailable: 1
    persistence:
      storageClass: local-path
      size: 20Gi

A few things worth calling out:

ingestion_rate_strategy: global. The Loki chart default is local, which divides your ingestion_rate_mb cap across distributor pods. Sounds reasonable until you remember that kube-proxy’s L4 load balancing can pin a long-lived connection to a single backend pod. When that happens, all your Alloy traffic lands on one write pod, which is allowed only ingestion_rate_mb / replicas MB/s, and you start getting 429 rate-limit errors despite plenty of headroom in aggregate. global applies the cap cluster-wide. Stickiness stops mattering.

max_global_streams_per_user: 50000. This is the cardinality ceiling. A stream is a unique combination of label values, so this caps how many you can have. We sized this at 50k specifically with our Windows fleet onboarding wave in mind — 650 hosts × ~24 streams per host projects ~16k, and 50k is 3x headroom against label changes we haven’t planned for. We’ll cover label strategy in much more depth in the next post.

retention_period: 365d as a global default, then per-stream overrides (we’ll cover those in 4b). Setting this once at the limits level means the compactor will reclaim space without you having to think about it.

Local-path PVC for write and backend. RKE2 ships local-path as a built-in storage class. We use it because object storage holds the durable data — the PVCs only carry the WAL and a small amount of working state. If a node dies, the WAL on its local-path PVC is lost, but the chunks already flushed to S3 are not. The replication factor of 3 inside the write StatefulSet covers the in-flight WAL data.

That last point has a sharp edge though, which is the next section.


The Required Pod Anti-Affinity

If you read the values block above carefully, the backend section has a hidden requirement we discovered the hard way:

backend:
  replicas: 2
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/component: backend
              app.kubernetes.io/name: loki
          topologyKey: kubernetes.io/hostname

requiredDuringSchedulingIgnoredDuringExecution means: pods of this type must schedule on different nodes. Required, not preferred. Without it, both backend replicas can land on the same node over time. With local-path PVC pinning, this creates a drain catch-22.

Here’s how the trap springs:

  1. Both backend replicas drift onto the same node, call it node-1c.
  2. You go to drain node-1c for an OS patch or RKE2 upgrade.
  3. Drain evicts backend-0. backend-0 goes Pending — its local-path PVC is pinned to node-1c, which is now cordoned, so it can’t reschedule.
  4. The PDB now has 0 disruption budget (only 1 of 2 replicas is available, and maxUnavailable: 1 is already consumed).
  5. Drain blocks trying to evict backend-1.
  6. Eventually drain times out at 10 minutes. You’re stuck.

We discovered this during an OS patching rehearsal that hit the same trap with tempo-ingester. Once we knew what to look for, the fix was easy — require anti-affinity at the scheduler level so the two replicas are structurally unable to land on the same node. N replicas, max one per node, drain only ever has to move one at a time per the PDB.

We applied the same fix to Loki backend, alertmanager, and tempo-ingester. Mimir’s ingester chart already had this configured correctly (it was unaffected during the rehearsal), so the pattern is from upstream — we just had to plumb it into the wrappers that didn’t get it.

Tip if you’re applying this same fix and you hit upstream tempo-distributed: that chart’s ingester.affinity value is consumed as a string passed through tpl, not a YAML mapping. Overriding it as a mapping silently fails. The Loki chart takes the normal mapping form, so what’s shown above works directly.


Memcached: 192 GB of Chunks Cache

Loki’s read path latency is dominated by S3 fetch time. The fix is memcached. The Loki chart bundles a memcached deployment for chunks, results, and (separately) the index. We run two of them:

chunksCache:
  enabled: true
  replicas: 3
  allocatedMemory: 65536               # 64 GB per replica
  maxItemMemory: 2                     # MB — log chunks 256KB–1.5MB
  connectionLimit: 16384
  resources:
    requests:
      cpu: 1000m
      memory: 78Gi                     # allocatedMemory × 1.2 for overhead
    limits:
      cpu: 4000m
      memory: 78Gi

resultsCache:
  enabled: true
  replicas: 2
  allocatedMemory: 8192                # 8 GB per replica
  maxItemMemory: 5
  connectionLimit: 16384
  defaultValidity: 336h                # 14 days

chunksCache is 3 replicas × 64 GB = 192 GB total. It caches log chunks pulled from S3, distributed via consistent hashing across the replicas. The first query for a given time range pays the S3 fetch cost; subsequent queries within the cache window hit memcached at memory speed.

resultsCache is smaller — 2 × 8 GB = 16 GB — and caches the post-query result so repeat dashboard refreshes don’t re-execute the LogQL. The defaultValidity: 336h (14 days) is intentional: the cache typically sits below 5% fill, and LRU eviction handles capacity. Letting cached results live 14 days means the dashboards your team opens every morning are basically free.

Sizing memcached is a tradeoff between cluster RAM and query latency. The numbers above work for our query volume. If you’re running on resource-constrained hardware, you can scale chunksCache replicas and per-replica memory down — Loki will just go to S3 more often. The numbers should never be zero though. Loki queries against unindexed log lines without a chunks cache are slow enough to make people stop trusting the platform.


Wrapping Up

That’s the deployment. SimpleScalable mode, three write replicas, three read replicas, two backend replicas, all on RKE2 with local-path PVCs for working state, all writing chunks and reading them back via Nutanix Objects.

The shape of it:

  • SimpleScalable mode — write/read/backend split, scale each independently, no separate distributor/ingester/querier pods.
  • TSDB index, schema v13, daily index period — the current standard. Set once at deployment, don’t revisit.
  • Single-tenantauth_enabled: false. We don’t need tenant isolation; you can flip it on later without rebuilding.
  • ingestion_rate_strategy: global — not the chart default. The chart default local divides the cap across distributors, which interacts badly with kube-proxy connection stickiness and produces phantom rate-limit errors.
  • Required pod anti-affinity on backend — without it, both replicas can drift onto the same node, and kubectl drain deadlocks against the PDB during an OS patch or RKE2 upgrade.
  • 192 GB memcached chunks cache — Loki without a chunks cache is slow enough to make people stop opening dashboards. Size it generously.

Next post (4b) is where the operational decisions live: label strategy, the per-stream retention table we run (it’s a great list — kube-audit at 90 days, switch syslog at 365 days, calico-node CNI noise at 30 days, plus a dozen more), LogQL patterns we actually use, and the troubleshooting lessons from running this stack day to day.

Happy automating!