Architectures for Large-Scale IoT Edge Container Cluster Management - 6 - Personal Recommendations
This article was last updated on: May 17, 2026 am
Previous Articles
- Architectures for Large-Scale IoT Edge Container Cluster Management - 0 - Introduction to Edge Containers and Architectures
- Architectures for Large-Scale IoT Edge Container Cluster Management - 1 - Rancher + K3s
- Architectures for Large-Scale IoT Edge Container Cluster Management - 2 - HashiCorp Solution: Nomad
- Architectures for Large-Scale IoT Edge Container Cluster Management - 3 - Portainer
- Architectures for Large-Scale IoT Edge Container Cluster Management - 4 - KubeEdge
- Architectures for Large-Scale IoT Edge Container Cluster Management - 5 - Summary
│ 📚️Reference:
│ IoT Edge Computing Article Series
Overview
In the previous articles, I listed the following solutions:
- Rancher + K3s
- HashiCorp Solution — Nomad + Docker
- Portainer + Docker
- KubeEdge
Among these, Rancher + K3s is a K8s-based and K8s-compatible solution; KubeEdge is built on top of K8s, but its core architecture is an entirely different system; while the HashiCorp and Portainer solutions are essentially unrelated to K8s and are primarily Docker-based (and can also work with other runtimes like podman, etc.).
Based on my edge architecture primarily consisting of single-board ARM development boards, I conducted hands-on testing of each solution.
After in-depth experience with two additional container platforms — HashiCorp Nomad and Portainer — it became clear that compared to K8s/K3s, these two are better suited for IoT scenarios. (Let’s set KubeEdge aside for now — I personally believe KubeEdge is better suited for more complex, business-coupled, or advanced edge computing systems that require edge AI scheduling.)
K8s is not well-suited for IoT for the following reasons:
- High resource consumption
- High network requirements
- Complex networking model
K3s only (partially) addresses the resource consumption issue, while the latter two problems remain.
│ 📝Notes
│
│ K3s is a fully K8s-compatible distribution. Since version 1.23, as K8s features have grown, K3s has also become increasingly heavy.
│ According to the K3s installation requirements:
│
│ * For arm64 devices, the OS must use 4K page sizes;
│ * CPU/Memory minimum is 1 core with 512MB RAM, recommended is 2 cores with 1GB; but in real-world scenarios, requirements are higher — typically starting at 2GB RAM.
│ * More critically, regarding storage requirements: the default K8s images needed to start K3s consume roughly 4–6 GB of space. If running etcd, SD cards — a common storage medium in edge scenarios — simply cannot meet the requirements.
│ Regarding the networking model, K3s recently added Tailscale integration to further simplify the networking model. However, K8s’s built-in Host Network / Service IP / Cluster IP cannot be avoided.
Both of these alternatives are lighter than K3s. They also have low network requirements — all devices can even be managed by a single server instance.
The networking model is also simple — host networking is sufficient.
Additionally, both have special optimizations for IoT scenarios, such as unidirectional network connectivity, optimizations for edge disconnection scenarios, and support for managing non-Docker resources.
Problems with the Rancher + K3s Model at the Edge
In practice, the Rancher + K3s edge deployment architecture looks like this:


The real-world deployment makes it clearer:
- 1 set: A Rancher cluster is deployed in the “cloud”, responsible for managing all K3s clusters at the “edge”. The Rancher cluster can also host cloud-side business applications, handling synchronization with edge-side business systems and dispatching data or commands.
- N sets: K3s is installed on “edge” devices, running edge business applications for “endpoint” devices to connect to. 🐾 Each “edge” requires a full K3s cluster, meaning both K3s Workers and a K3s Master.
- “Endpoints” are the outermost edge of business applications, connecting to the “edge” via network to form a business network centered around the “edge”.
What problems does this architecture have in practice?
Edge Storage Capacity
Many edge devices have only 8GB of storage. After the OS and essential packages, only 4–6 GB remains. K8s also enforces storage GC. In extreme cases, the K3s master may not even be able to pull complete image packages, causing system errors and K3s startup failures.
Insufficient Edge Storage Performance
Edge storage is primarily SD cards or eMMC. If multiple K3s workers are attached below, the K3s Master also faces insufficient storage I/O performance.
High Network Requirements Between K3s Master and Workers at the Edge
Since K3s is fully K8s-compatible, and K8s’s watch-list mechanism is designed for stable data center environments, it also demands high network quality. But this is simply impossible at the edge. In practice, the following situations occur:
- Master and Worker disconnected for extended periods
- Intermittent connectivity between Master and Worker
- Worker offline for extended periods
- DNS network anomalies
- Master can reach Worker, but Worker cannot reach Master
- Master cannot reach Worker, but Worker can reach Master
- Very low bandwidth between Master and Worker
- IP address changes between Master and Worker
- …
All kinds of network instability are extremely common. Any of the above situations can cause application failures on K3s workers, K3s master failures, or even complete K3s edge cluster failures.
This is the biggest problem with K8s-based edge architectures.
High Network Requirements Between Cloud and Edge, with No Self-Healing After Failures
The cloud and edge are connected through Rancher’s Agent using the WebSocket protocol.
However, in practice, the cloud-to-edge connection also demands high network quality. Managing the edge from the cloud requires large amounts of real-time data synchronization. Network anomalies can cause the edge to go offline with no ability to self-heal and reconnect.
Poor Edge Self-Healing Capability
Due to the incompatibility between complex edge network conditions and K8s architecture, edge self-healing capability is very poor in practice.
When edge anomalies occur, nine times out of ten, manual login to the edge device is required for recovery.
High CPU and Memory Requirements at the Edge
The edge runs a complete K8s cluster, which means all of these services must be running:
- etcd or K3s kine + SQLite
- K8s API Server
- K8s Scheduler
- Various K8s controllers
- CRI: containerd runtime
- CNI: network plugin
- CSI: at minimum a local-path pod
- CoreDNS
- Metrics Server
- Ingress: Traefik
- kubelet
- kube-proxy
Beyond these pods, host-level services are also required:
- iptables or nftables
- Netfilter
- OverlayFS
- …
All these pods and services consume significant CPU and memory resources on edge devices.
Complex Edge Networking Model
Again, this is a problem introduced by K8s. The edge runs a complete K8s/K3s cluster, so the edge networking model naturally includes:
- Host Network
- Service Network
- Cluster Network
- DNS
Implementing the K8s CNI model also introduces overlay networks.
To achieve cloud-edge-endpoint connectivity, you may even need to introduce:
- Tunnel networks
- Edge gateway networks
- …
This further increases network complexity!
It makes troubleshooting extremely difficult — turning simple problems into complex ones.
Summary
With Rancher + K3s, the network topology architecture for edge computing is:
- 1 set: A Rancher cluster deployed in the “cloud”, managing all K3s clusters at the “edge”. The Rancher cluster can also host cloud-side business applications, handling synchronization with edge-side systems and dispatching data or commands.
- N sets: K3s installed on “edge” devices, running edge business applications for “endpoints” to connect to. 🐾 Each “edge” requires a full K3s cluster with both K3s Workers and a K3s Master.
- “Endpoints” are the outermost edge, connecting to the “edge” via network to form a business network centered around the “edge”.
K3s is a K8s-compatible implementation, and due to the network topology architecture, this approach has the following problems:
- Insufficient edge storage capacity
- Insufficient edge storage performance
- Edge cannot meet the high network requirements between K3s Master and Workers
- Excessively high network requirements between cloud and edge, with no self-healing after failures
- Poor edge self-healing capability
- High CPU and memory requirements at the edge
- Complex edge networking model
And each of these problems is virtually unsolvable.
Edge Computing Requirements for Container Orchestration
Based on the scenarios above, here is my understanding of what edge computing needs from container orchestration:
- Ideally just a single agent at the edge — no additional images or components
- Lightweight resource consumption — minimal CPU, memory, and storage usage for the agent
- Unified cloud-side management — no resource-consuming management plane needed at the edge
- Strong adaptability to weak network environments
- Strong self-healing capability
- No additional networking models — ideally runs directly on host networking without hard DNS dependency
The Docker-based edge container cluster management solutions:
- HashiCorp Nomad
- Portainer
perform much better in these regards.
Portainer for Edge Computing

Portainer has specific optimizations for edge scenarios — refer to the Edge architecture on the left side.
- The “cloud” side handles management, with the Portainer server in the cloud
- The “edge” side runs only Edge Agents
- Cloud and edge don’t require bidirectional communication — the edge periodically pulls information from the server via direct connection or tunnel
These are networking model optimizations that make it more suitable for edge computing scenarios.
Additionally, Portainer’s Agent is extremely lightweight, using only about 10MB of memory. Only this single Agent needs to be deployed at the edge — which is why it can run on edge devices with limited hardware resources. The only dependency is that Docker must be installed on the edge device.
Open Source Edition Edge Features
- Edge Agent default network policy: pull
- Edge Compute modules and menus:
- Edge Groups: Group edge devices/environments statically or dynamically into Edge Groups for batch management
- Edge Stacks: Similar to Docker Compose — push a set of edge services to one or all Edge Groups
- Edge Jobs: Similar to crontab — run batch jobs on edge devices
- Edge emergency feature support:
- Intel OpenAMT
- FDO
- Manage edge OS-level filesystem (implemented via Docker Linux socket)
With these features, Portainer can:
- Maintain connectivity and management between edge devices and the Portainer Server through pull heartbeats under complex network conditions
- Achieve batch service and job deployment to edge devices through Edge Groups / Edge Stacks / Edge Jobs
However, the open source edition has significantly fewer features compared to the commercial edition. For example, the open source edition lacks one-click batch onboarding and other important edge features.
Commercial Edition Additional Features
- One-click onboarding: batch installation of edge devices
- Secure communication: tunnel-based connections under complex and offline network conditions
- Edge Devices: edge device management
- Waiting Room: selectively accept edge device connection requests to the server
Summary
In my opinion, compared to the Rancher + K3s approach, Portainer’s edge computing solution has the following key advantages:
- Low resource consumption
- Networking model optimized for edge networks
In more detail, these features make it more suitable for edge scenarios than Rancher + K3s:
- Server in the “cloud”, only Agent at the “edge”
- Lightweight Edge Agent — approximately 10MB runtime memory
- Edge Agent pull model — strong adaptability to edge networks
- No additional networking models introduced, no hard DNS dependency
However, the open source edition of Portainer has notable feature gaps. The most significant missing feature is:
- One-click onboarding: batch installation of edge devices
HashiCorp Nomad for Edge Computing

Since version 1.3, HashiCorp Nomad has added many practical features for edge scenarios:
- 1.3 introduced: Nomad native service discovery (eliminating the need for Consul in simple scenarios)
- 1.4 introduced:
- Health checks
- Nomad Variables (eliminating the need for Vault in simple scenarios)
- 1.5 introduced:
- Dynamic node metadata for easier dynamic node management
- 1.6 introduced:
- Node Pool concept for easier batch node management
These improvements mean that at the edge, there is no longer a dependency on:
- Consul
- Vault
With just the Nomad Agent alone, you can achieve:
- Container orchestration and management
- Basic service discovery and management
- Variables / environment variables / configuration management
Similar to Portainer, it only requires a single Edge Agent. Memory consumption is approximately 20–40 MB.
Nomad supports geographically distant clients, meaning the Nomad server cluster does not need to run near the clients. (K8s simply cannot do this.)
Additionally, disconnected client allocations can reconnect gracefully, handling situations where edge devices experience network latency or temporary connectivity loss.
Two Nomad parameters deserve special mention here:
max_client_disconnect
Without this parameter, Nomad runs its default behavior: when a Nomad client’s heartbeat fails, Nomad marks the client as down and its allocations as lost. Nomad automatically schedules new allocations on another client. However, if the downed client reconnects to the server, it shuts down its existing allocations. This is suboptimal because Nomad stops running allocations on the reconnected client only to place the same allocations again. (K8s behaves the same way, and can only behave this way.)
For many edge workloads — especially those with high latency or unstable network connections — this is disruptive, because a disconnected client doesn’t necessarily mean the client is down. Allocations can continue running on temporarily disconnected clients. For these cases, you need to set the max_client_disconnect parameter to gracefully handle disconnected client allocations.
With max_client_disconnect set, when a client disconnects, Nomad still schedules allocations on another client. However, when the client reconnects:
- Nomad marks the reconnected client as ready.
- If there are multiple job versions, Nomad selects the latest job version and stops all other allocations.
- If Nomad rescheduled the lost allocation to a new client and the new client has a higher node rank, Nomad continues the allocation on the new client and stops all others.
- If the new client has a lower node rank or there is a tie, Nomad resumes the allocation on the reconnected client and stops all others.
This is the preferred behavior for edge workloads with high latency or unstable network connections, especially when disconnected allocations are stateful.
For example:
An edge device is running a web service. The edge device then loses connectivity with the (edge container management) server.
- In K8s, the node enters Unknown or NotReady status. The web service is considered down, and a new instance is started on another edge device. After reconnection, the system finds the latest instance is on the other device, so the original device’s service is shut down. For users of that web service, they may find the service unavailable (shut down by the management plane) after the edge device reconnects.
- In Nomad with this parameter enabled, the node enters lost status and the allocated service enters Unknown status. A new web service is started on another edge device. After reconnection, the system finds the web service is still running normally on the original device and shuts down the later-started instance. For users, the experience is uninterrupted.
Template change_mode
Another important parameter is change_mode under the Template block.
Set the Template section’s change_mode to noop. By default, change_mode is set to restart, which causes the task to fail if the client cannot connect to the Nomad server. Since Nomad schedules this job on edge data centers, if the edge client disconnects from the Nomad server (and thus from service discovery), the service will use the previous template configuration.
Summary
In my opinion, compared to the Rancher + K3s approach, HashiCorp Nomad’s edge computing solution has the following key advantages:
- Low resource consumption — only a Nomad Agent at the edge
- Management server and agents can be physically far apart — edge devices across the globe can be managed from a single cloud center
- Parameters specifically optimized for edge networks — such as max_client_disconnect
In more detail, these features make it more suitable for edge scenarios than Rancher + K3s:
- Server in the “cloud”, only Agent at the “edge”
- Lightweight Nomad Agent — approximately 20–40MB runtime memory
- Nomad Agent heartbeat with server is pull-based — strong adaptability to edge networks
- Purpose-built max_client_disconnect parameter for edge scenarios
- No additional networking models introduced, no hard DNS dependency
Additionally, compared to the open source edition of Portainer, Nomad has another advantage:
- One-click onboarding: batch installation or even pre-installation of edge devices
Thanks to Nomad’s well-designed architecture, it is inherently built for large-scale container orchestration. It supports one-click batch deployment via Terraform or Ansible, and even pre-installation (flashing). A typical Nomad Agent configuration is as simple as:
1 | |
Conclusion
In the field of IoT edge container cluster management, K8s-based solutions (including Rancher + K3s) are clearly not well-suited. After roughly two years of hands-on experience, I encountered too many pitfalls. The main reasons are:
- High resource consumption
- High network requirements
- Complex networking model
- Poor self-healing capability
- Too many additional components introduced
However, HashiCorp Nomad and Portainer have clear advantages over K8s/K3s in the IoT/edge computing domain. They are worth trying because they are:
- Lightweight: just a single Agent with minimal memory footprint
- Specifically optimized for edge networks
- No complex networking models introduced (no Service Network, Pod Network, Overlay Network… — primarily relying on Host Network, container port mapping, and at most a bridge network), no DNS dependency
- Strong self-healing capability
- No excessive additional components — just one extra Agent
My personal recommendation is to use Nomad (for small-scale or home scenarios, Portainer is a good choice).
That’s all.
If you have better experiences to share, feel free to discuss~