Building a Talos Kubernetes Cluster from Scratch

Building a Talos Kubernetes Cluster from Scratch

Building a Kubernetes cluster from scratch on bare metal is one of those projects that teaches you more about infrastructure than any managed service ever could. This is the story of K8S-CLUSTER — a 6-node Talos Linux cluster running on mini PCs, with full disk encryption, Cilium eBPF networking, and Longhorn distributed storage.

Why Talos?

Talos Linux is a minimal, immutable OS purpose-built for Kubernetes. There's no SSH, no shell, no package manager — everything is managed through a declarative API. This makes it ideal for a homelab where you want production-grade infrastructure without the maintenance burden of traditional Linux nodes.

The tradeoff is steep: you can't just apt install something when you need it. Every system extension must be baked into the installer image at provision time via Image Factory. But what you get in return is a cluster that's reproducible, auditable, and resistant to configuration drift.

Cluster Architecture

The cluster runs on 6 bare-metal mini PCs, all acting as combined control-plane and worker nodes:

NodeIPCPURAMRole
node0110.x.x.164C8GBCP + Worker
node0210.x.x.176C8GBCP + Worker + Storage
node0310.x.x.184C16GBCP + Worker + Storage
node0410.x.x.194C16GBCP + Worker + Storage
node0510.x.x.204C16GBCP + Worker + Storage
node0610.x.x.214C16GBCP + Worker + Storage

A virtual IP at 10.x.x.10 floats across control plane nodes, providing a stable API endpoint without an external load balancer. KubePrism handles local HA on port 7445.

Config Architecture: Base + Patches

One of the most important design decisions was separating the machine configuration into composable layers:

  • controlplane.yaml — Cluster-wide settings: API server config, PKI certificates, audit policy, PodSecurity admission, kubelet settings. Contains zero node-specific configuration.
  • Per-node patches — Each node gets its own patch with hostname, static IP, VIP assignment, and disk layout.
  • Feature patches — Cilium CNI, LUKS encryption, Longhorn extensions, and control-plane scheduling are each separate patches.

The apply command stacks them:

talosctl apply-config --insecure --nodes 10.x.x.16 \
  --file controlplane.yaml \
  --config-patch @patches/k8s-node01.yaml \
  --config-patch @patches/cilium-cni.yaml \
  --config-patch @patches/allow-scheduling-cp.yaml \
  --config-patch @patches/disk-encryption.yaml \
  --config-patch @patches/longhorn-extensions.yaml

This separation means adding a new feature (like disk encryption) is a single patch applied to all nodes, and per-node hardware differences (interface names, disk paths) are isolated to their own files. No merge conflicts, no configuration drift.

LUKS2 Full Disk Encryption

Both the STATE partition (Talos OS state) and EPHEMERAL partition (container runtime data) are encrypted with LUKS2:

apiVersion: v1alpha1
kind: VolumeConfig
name: STATE
encryption:
  provider: luks2
  keys:
    - nodeID: {}
      slot: 0
---
apiVersion: v1alpha1
kind: VolumeConfig
name: EPHEMERAL
encryption:
  provider: luks2
  keys:
    - nodeID: {}
      slot: 0

The nodeID key type derives the encryption key from the machine's unique identifier — no manual key management, no key distribution problem. Each node's encrypted volumes appear as /dev/dm-0 and /dev/dm-1. Performance impact on NVMe is negligible.

Cilium eBPF Networking

The cluster uses Cilium v1.19.1 as the CNI with full kube-proxy replacement via eBPF. The base config sets:

cluster:
  network:
    cni:
      name: none    # Talos bootstraps without CNI; Cilium installed via Helm post-bootstrap
  proxy:
    disabled: true  # Cilium replaces kube-proxy entirely

Cilium is deployed post-bootstrap via Helm with L2 load balancing, Hubble observability, and 2 operator replicas for HA. The eBPF dataplane handles all packet forwarding at the kernel level — no iptables rules to manage or debug.

One critical gotcha: Cilium's eBPF kube-proxy replacement breaks Tailscale Service annotations. The eBPF intercepts ClusterIP DNAT at the traffic control layer, causing asymmetric routing that confuses Tailscale's proxy. The fix is to use Tailscale Ingress resources (L7) instead of Service annotations (L4). More on this in a future post.

Longhorn Distributed Storage

Persistent storage runs on Longhorn v1.8.1 with 3-way replication across 5 storage nodes (node02-06). Each storage node has a SATA SSD mounted at /var/lib/longhorn via Talos' machine.disks config.

The raw capacity is ~1.8 TB (1x 1TB + 4x 200GB), giving ~600 GB usable with 3-way replication. Longhorn is set as the default StorageClass so all PVC requests automatically get distributed, replicated storage.

Longhorn requires iSCSI tools to manage block devices. Since Talos has no package manager, these are baked into the installer image via Image Factory:

machine:
  install:
    image: factory.talos.dev/installer/your-schematic-hash....:v1.12.5

That schematic hash includes iscsi-tools and util-linux-tools. If you need to add or remove extensions, you generate a new schematic and re-provision.

Security Hardening

The cluster enforces multiple security layers:

  • PodSecurity Admission — Baseline enforcement cluster-wide, with exemptions only for kube-system and longhorn-system
  • Seccomp — RuntimeDefault profile enforced by kubelet for all containers
  • Audit Policy — RequestResponse logging for secrets and RBAC writes, Metadata for everything else
  • CiliumNetworkPolicy — Per-namespace egress/ingress rules (default deny)
  • Encrypted etcd backups — Daily snapshots encrypted with age, 30-day retention

The audit policy is deliberately scoped: full request/response for sensitive operations (secrets, RBAC bindings), metadata-only for everything else. This keeps etcd write pressure low while capturing the events that matter for security forensics.

etcd Backup Strategy

A cron job on the management host runs daily at 2 AM:

talosctl etcd snapshot etcd-backup.snapshot
age -r age1xx... -o etcd-backup.snapshot.age etcd-backup.snapshot
rm etcd-backup.snapshot
find backups/ -name "*.age" -mtime +30 -delete

Snapshots are taken from node01 (the first control plane), encrypted with age, and stored locally with 30-day retention. Recovery requires the private key stored in .age-key.txt (not in git, not on any node).

Lessons Learned

After building and migrating this cluster (including a full subnet migration from 192.168.x.0/24 to 10.x.x.0/24), here are the gotchas worth knowing:

  1. Old disk signatures block partitioning — LVM/bluestore signatures on secondary disks prevent Talos from partitioning them. Wipe first with a privileged pod: dd if=/dev/zero of=/dev/sda bs=1M count=10
  2. Device names shift after wipe — A disk at /dev/sdb can become /dev/sda after removing LVM device mapper. Standardize your config after cleanup.
  3. etcd doesn't auto-update peer URLs on IP change — Subnet migration requires removing the etcd member, resetting STATE+EPHEMERAL, and re-applying config in maintenance mode.
  4. Image Factory schematics are immutable — Need a new extension? New schematic hash, new installer image, full node re-provision.
  5. HostnameConfig overrides machine.network.hostname — If both exist, HostnameConfig wins silently. Remove it from your base config.
  6. Regenerate kubeconfig after reboot — Talos rotates certificates on boot. Run talosctl kubeconfig --force to stay current.
  7. Single-node etcd reset needs --graceful=false — Without it, etcd hangs waiting for quorum that doesn't exist.

What's Running

The cluster currently hosts a mix of web applications, databases, internal tools, and a full monitoring stack. All services are exposed via Tailscale Ingress for private access, with selected services also available publicly through Cloudflare Tunnel.

Total resource usage is modest: ~2.1 CPU cores requested, ~2.2 GB RAM requested, ~30 GB storage across all workloads. There's plenty of headroom for growth on this 28-core, 72 GB cluster.