How I Run My Home Lab Like a Small Production Environment

Abstract illustration of a private home lab infrastructure stack

I run my home lab as a small production environment. Not because the house needs enterprise architecture, but because the habits transfer: monitoring, backups, safe access, recovery paths, update policy, and enough automation that a reboot does not become a manual rebuild.

This is a sanitised version of the setup. I have removed hostnames, addresses, domains, paths, usernames, room IDs, credentials, and anything else that would identify the environment. I have also left out the media download stack on purpose. Plex is included because it is the user-facing media service; the automation behind it does not need to be public.

The stack at a glance

AreaApps and systemsWhat they do
Network edgeUniFi gatewayRouting, Wi-Fi, firewall rules, port forwarding, segmentation, and gateway-level blocking.
DNS and ingressCloudflare DNS, Cloudflare TunnelPublic DNS, secure ingress for selected services, edge filtering, and fewer open ports.
Primary storageNAS with DockerMain storage, application hosting, backup control, monitoring parent, and general self-hosted workloads.
Secondary storage and proxyNAS with Docker, Nginx, CertbotReverse proxy, TLS renewal, communications services, secondary DNS, and selected public-facing routes.
Storage expansionQNAP NASNew storage target and migration path for the media library.
DNS filteringAdGuard HomeNetwork DNS filtering, tracker blocking, and internal DNS overrides.
Remote accessTailscalePrivate admin access without exposing management ports.
Password vaultVaultwardenSelf-hosted password management with controlled account creation.
PhotosImmichPhoto uploads, thumbnails, search, local ML features, and private library ownership.
MediaPlex, FileFlowsMedia library, streaming, and file processing workflows, backed by storage, monitoring, and a migration plan.
Home automationHome AssistantDevice control, dashboards, sensors, automations, alerts, and integration testing.
Streamsgo2rtcLocal camera and real-time stream handling.
Household toolsKitchenOwlShared household and grocery management.
CommunicationsMatrix Synapse, Element Web, Synapse Admin, mautrix bridgesPrivate messaging, admin tooling, platform bridges, and the alert bus for the lab.
MonitoringNetdata, Uptime KumaHost metrics, container metrics, uptime checks, and endpoint monitoring.
AlertingMatrix bot accountsBackup results, uptime failures, update notifications, system events, and security alerts.
BackupsBackrest, Restic, pg_dump, volume export scriptsSnapshot backups, database dumps, named-volume exports, retention policy, and restore workflows.
UpdatesWatchtowerAutomatic updates for low-risk containers and monitor-only alerts for fragile services.
SecurityCrowdSec, Cloudflare bouncer, UniFi bouncer, firewall bouncerLog analysis, ban decisions, gateway blocking, edge blocking, host blocking, and notifications.
Local AIHermes Agent, vLLM-compatible server, DockerPrivate OpenAI-compatible LLM API, agent runtime, tool execution, and local automation workflows.
Image generationComfyUILocal image generation workflows and model testing.
Scheduled automationCronReboot recovery, cleanup tasks, database exports, and small health scripts.

Why I split the lab across systems

The most useful design choice has been separating storage-heavy workloads from public-edge workloads. The primary NAS does the heavy lifting: data, containers, monitoring, backups, and services that need local disk access. A second system handles reverse proxying, TLS, communications, and public-facing routes.

That split keeps the storage box from becoming the front door for everything. It also gives me a cleaner failure model. If I break a proxy rule, I should not be risking the data layer. If a storage job is busy, it should not make the public edge fragile.

The new storage target is there for migration and expansion. I do not treat that as a one-off copy. Once a device holds important data, it needs to enter the operating model: monitored, backed up, documented, and tested.

The edge should stay small

The gateway does routing, firewall policy, Wi-Fi, and selected forwarding. It does not become the admin interface for the whole lab.

Remote access goes through a private mesh VPN. Selected public services go through Cloudflare Tunnel or a controlled reverse proxy path. Internal DNS keeps local traffic local where that makes sense. That matters for services like photo uploads, where sending local mobile traffic out to the internet and back is wasteful and slow.

Nginx acts as the front desk. It terminates TLS, routes requests, and keeps application containers away from the public edge. Certbot renews certificates through DNS validation, so certificate renewal does not require opening extra challenge endpoints.

Home Assistant is the practical control plane

Home Assistant is the part of the lab that turns infrastructure into daily utility. It gives me device control, dashboards, automations, sensors, alerts, and a real testbed for integration work.

The small loops matter most. Sensors feed dashboards. Dashboards expose the controls people need. Automations remove repeated actions. Alerts tell me when something needs attention. Custom integrations fill gaps when commercial products do not talk to each other cleanly.

Home Assistant also punishes sloppy engineering. A bad automation annoys real people. A fragile integration fails in the middle of normal life. If something is too brittle for the house, it needs better error handling.

My favourite automations

Plex is the visible media service

Plex is the part of the media setup people actually see. The rest of the media download automation is deliberately out of scope for this post.

Plex still makes a useful infrastructure test. It depends on reliable storage, sensible library paths, configuration backups, hardware-aware transcoding, and a migration plan when storage changes. Large files, metadata scans, remote clients, and background jobs find weak storage design fast.

Immich made photos an owned service

Immich moved photos out of the category of someone else’s platform and into the lab. Mobile uploads, thumbnails, search, and local ML run on hardware I control.

Photos demand a higher backup standard than replaceable media. The library, database, generated assets, and application config all need a restore path. A photo service is not healthy because the web UI loads. It is healthy when I can restore it and trust the result.

Matrix is the alert bus

Matrix started as a communications service, but it has become the operations room. Synapse provides the homeserver, Element gives me a web client, and bridges connect selected external messaging platforms.

The bigger win is alerting. Backup jobs, uptime checks, container updates, security decisions, SSH events, and system scripts all report into private rooms. I do not want five dashboards open to know if the lab is healthy. I want the lab to tell me when something changed.

Monitoring is not the same as uptime

Netdata and Uptime Kuma answer different questions. Netdata tells me what the hosts and containers are doing. Uptime Kuma tells me whether services can be reached.

You need both. A host can look fine while a login page is broken. A service can respond while the storage layer is running out of space. I also tune alerts for the environment. Media processing can legitimately use a lot of CPU. Storage caches can create scary-looking backlog patterns during normal operation. Alerting should reduce uncertainty, not create noise.

Backups are only useful if restore is designed

Backrest and Restic handle snapshot backups. The backup plan includes the important data and excludes junk: active working directories, caches, container overlay layers, and anything that would make restores slower without adding value.

Some container data needs extra handling. Databases and named volumes do not always become safe backups just because their files were copied. I use export scripts for database dumps and volume archives so the backup system captures something I can restore with confidence.

The restore path is the product. Stop the service, restore config, restore the database or volume export, start the service, verify it. If I cannot explain that path, the backup is not finished.

Updates need risk tiers

Watchtower handles container update checks, but not every container gets the same policy. Low-risk services can update automatically. Fragile services run monitor-only. Locally built or pinned images stay out of automated update checks.

Auto-updating everything is reckless. Manually updating everything sounds safe until you stop doing it. Risk tiers give me a workable middle ground: update the boring stuff, warn me about the risky stuff, and avoid noisy checks for images that should not change.

Security needs detection and enforcement

CrowdSec gives the lab a feedback loop. It analyses logs, creates decisions, and pushes those decisions to enforcement points.

The useful pattern is layered enforcement. Some decisions can apply at the cloud edge. Some can apply at the gateway. Some can apply at the host. The goal is not to pretend a home lab is an enterprise SOC. The goal is to practise the loop: detect, decide, enforce, notify, review.

Local AI is now part of the lab

The local AI box runs two kinds of workload. One is a vLLM-compatible server that exposes a private OpenAI-style API for LLM experiments and agent workloads. The other is ComfyUI for local image generation.

Those workloads compete for memory, so I use scripts to switch modes. That sounds simple, but it is the kind of operational habit that matters. Make the desired state explicit. Script the transition. Test the health check. Do not rely on remembering which process should be running.

See more: DGX Spark: Running Production AI on a $3,000 Desktop

Hermes is the operator layer

Hermes is where the lab stops being a set of dashboards and starts behaving like an operator. I use it as the agent layer that can read context, call tools, run commands, write code, and turn a problem into a change I can review.

The useful part is the number of systems it can reach. It connects into Home Assistant through MCP-backed tools, works with GitHub for repos, issues, pull requests, and releases, and can query UniFi when I need network state instead of guesswork. It also has access to local files, shell commands, browser checks, scheduled jobs, and the notes/memory layer that carries context between sessions.

That changes the operating model. If an integration breaks, Hermes can inspect the logs, patch the repo, run tests, open the PR, and help me release it. If the house has a problem, it can look at Home Assistant state and the surrounding infrastructure instead of answering from memory. The AI box provides the local model capacity. Hermes gives that capacity hands.

Cron still earns its place

Not every job needs a platform. Cron handles the small tasks that keep the lab from depending on my memory: reboot recovery, cleanup jobs, database exports, and system event checks.

Those scripts are boring right up until they do not exist. Then the same manual fix comes back after every reboot.

The operating model matters more than the apps

The apps will change. The hardware will change. The useful part is the operating model.

A home lab becomes valuable when it has clear ingress paths, private admin access, monitored services, working backups, tested restores, alert routing, update policy, security feedback loops, documented recovery steps, and enough automation to survive a normal failure.

That is why I keep building it. It is a place to practise production thinking with real users, real data, and real consequences, without waiting for a business case.

Connect with me on LinkedIn.