Architectures for Large-Scale IoT Edge Container Cluster Management - 6 - Personal Recommendations

This article was last updated on: June 29, 2026 pm

Architectures for Large-Scale IoT Edge Container Cluster Management - 0 - Introduction to Edge Containers and Architectures
Architectures for Large-Scale IoT Edge Container Cluster Management - 1 - Rancher + K3s
Architectures for Large-Scale IoT Edge Container Cluster Management - 2 - HashiCorp Solution: Nomad
Architectures for Large-Scale IoT Edge Container Cluster Management - 3 - Portainer
Architectures for Large-Scale IoT Edge Container Cluster Management - 4 - KubeEdge
Architectures for Large-Scale IoT Edge Container Cluster Management - 5 - Summary

│ 📚️Reference:
│ IoT Edge Computing Article Series

Overview

In the previous articles, I listed the following solutions:

Rancher + K3s
HashiCorp Solution — Nomad + Docker
Portainer + Docker
KubeEdge

Among these, Rancher + K3s is a K8s-based and K8s-compatible solution; KubeEdge is built on top of K8s, but its core architecture is an entirely different system; while the HashiCorp and Portainer solutions are essentially unrelated to K8s and are primarily Docker-based (and can also work with other runtimes like podman, etc.).

Based on my edge architecture primarily consisting of single-board ARM development boards, I conducted hands-on testing of each solution.

After in-depth experience with two additional container platforms — HashiCorp Nomad and Portainer — it became clear that compared to K8s/K3s, these two are better suited for IoT scenarios. (Let’s set KubeEdge aside for now — I personally believe KubeEdge is better suited for more complex, business-coupled, or advanced edge computing systems that require edge AI scheduling.)

K8s is not well-suited for IoT for the following reasons:

High resource consumption
High network requirements
Complex networking model

K3s only (partially) addresses the resource consumption issue, while the latter two problems remain.

│ 📝Notes
│
│ K3s is a fully K8s-compatible distribution. Since version 1.23, as K8s features have grown, K3s has also become increasingly heavy.
│ According to the K3s installation requirements:
│
│ * For arm64 devices, the OS must use 4K page sizes;
│ * CPU/Memory minimum is 1 core with 512MB RAM, recommended is 2 cores with 1GB; but in real-world scenarios, requirements are higher — typically starting at 2GB RAM.
│ * More critically, regarding storage requirements: the default K8s images needed to start K3s consume roughly 4–6 GB of space. If running etcd, SD cards — a common storage medium in edge scenarios — simply cannot meet the requirements.
│ Regarding the networking model, K3s recently added Tailscale integration to further simplify the networking model. However, K8s’s built-in Host Network / Service IP / Cluster IP cannot be avoided.

Both of these alternatives are lighter than K3s. They also have low network requirements — all devices can even be managed by a single server instance.

The networking model is also simple — host networking is sufficient.

Additionally, both have special optimizations for IoT scenarios, such as unidirectional network connectivity, optimizations for edge disconnection scenarios, and support for managing non-Docker resources.

Problems with the Rancher + K3s Model at the Edge

In practice, the Rancher + K3s edge deployment architecture looks like this:

Rancher + K3s Edge Architecture

Rancher + K3s Real-World Deployment

The real-world deployment makes it clearer:

1 set: A Rancher cluster is deployed in the “cloud”, responsible for managing all K3s clusters at the “edge”. The Rancher cluster can also host cloud-side business applications, handling synchronization with edge-side business systems and dispatching data or commands.
N sets: K3s is installed on “edge” devices, running edge business applications for “endpoint” devices to connect to. 🐾 Each “edge” requires a full K3s cluster, meaning both K3s Workers and a K3s Master.
“Endpoints” are the outermost edge of business applications, connecting to the “edge” via network to form a business network centered around the “edge”.

What problems does this architecture have in practice?

Edge Storage Capacity

Many edge devices have only 8GB of storage. After the OS and essential packages, only 4–6 GB remains. K8s also enforces storage GC. In extreme cases, the K3s master may not even be able to pull complete image packages, causing system errors and K3s startup failures.

Insufficient Edge Storage Performance

Edge storage is primarily SD cards or eMMC. If multiple K3s workers are attached below, the K3s Master also faces insufficient storage I/O performance.

High Network Requirements Between K3s Master and Workers at the Edge

Since K3s is fully K8s-compatible, and K8s’s watch-list mechanism is designed for stable data center environments, it also demands high network quality. But this is simply impossible at the edge. In practice, the following situations occur:

Master and Worker disconnected for extended periods
Intermittent connectivity between Master and Worker
Worker offline for extended periods
DNS network anomalies
Master can reach Worker, but Worker cannot reach Master
Master cannot reach Worker, but Worker can reach Master
Very low bandwidth between Master and Worker
IP address changes between Master and Worker
…

All kinds of network instability are extremely common. Any of the above situations can cause application failures on K3s workers, K3s master failures, or even complete K3s edge cluster failures.

This is the biggest problem with K8s-based edge architectures.

High Network Requirements Between Cloud and Edge, with No Self-Healing After Failures

The cloud and edge are connected through Rancher’s Agent using the WebSocket protocol.

However, in practice, the cloud-to-edge connection also demands high network quality. Managing the edge from the cloud requires large amounts of real-time data synchronization. Network anomalies can cause the edge to go offline with no ability to self-heal and reconnect.

Poor Edge Self-Healing Capability

Due to the incompatibility between complex edge network conditions and K8s architecture, edge self-healing capability is very poor in practice.

When edge anomalies occur, nine times out of ten, manual login to the edge device is required for recovery.

High CPU and Memory Requirements at the Edge

The edge runs a complete K8s cluster, which means all of these services must be running:

etcd or K3s kine + SQLite
K8s API Server
K8s Scheduler
Various K8s controllers
CRI: containerd runtime
CNI: network plugin
CSI: at minimum a local-path pod
CoreDNS
Metrics Server
Ingress: Traefik
kubelet
kube-proxy

Beyond these pods, host-level services are also required:

iptables or nftables
Netfilter
OverlayFS
…

All these pods and services consume significant CPU and memory resources on edge devices.

Complex Edge Networking Model

Again, this is a problem introduced by K8s. The edge runs a complete K8s/K3s cluster, so the edge networking model naturally includes:

Host Network
Service Network
Cluster Network
DNS

Implementing the K8s CNI model also introduces overlay networks.

To achieve cloud-edge-endpoint connectivity, you may even need to introduce:

Tunnel networks
Edge gateway networks
…

This further increases network complexity!

It makes troubleshooting extremely difficult — turning simple problems into complex ones.

Summary

With Rancher + K3s, the network topology architecture for edge computing is:

1 set: A Rancher cluster deployed in the “cloud”, managing all K3s clusters at the “edge”. The Rancher cluster can also host cloud-side business applications, handling synchronization with edge-side systems and dispatching data or commands.
N sets: K3s installed on “edge” devices, running edge business applications for “endpoints” to connect to. 🐾 Each “edge” requires a full K3s cluster with both K3s Workers and a K3s Master.
“Endpoints” are the outermost edge, connecting to the “edge” via network to form a business network centered around the “edge”.

K3s is a K8s-compatible implementation, and due to the network topology architecture, this approach has the following problems:

Insufficient edge storage capacity
Insufficient edge storage performance
Edge cannot meet the high network requirements between K3s Master and Workers
Excessively high network requirements between cloud and edge, with no self-healing after failures
Poor edge self-healing capability
High CPU and memory requirements at the edge
Complex edge networking model

And each of these problems is virtually unsolvable.

Edge Computing Requirements for Container Orchestration

Based on the scenarios above, here is my understanding of what edge computing needs from container orchestration:

Ideally just a single agent at the edge — no additional images or components
Lightweight resource consumption — minimal CPU, memory, and storage usage for the agent
Unified cloud-side management — no resource-consuming management plane needed at the edge
Strong adaptability to weak network environments
Strong self-healing capability
No additional networking models — ideally runs directly on host networking without hard DNS dependency

The Docker-based edge container cluster management solutions:

HashiCorp Nomad
Portainer

perform much better in these regards.

Portainer for Edge Computing

Portainer Architecture

Portainer has specific optimizations for edge scenarios — refer to the Edge architecture on the left side.

The “cloud” side handles management, with the Portainer server in the cloud
The “edge” side runs only Edge Agents
Cloud and edge don’t require bidirectional communication — the edge periodically pulls information from the server via direct connection or tunnel

These are networking model optimizations that make it more suitable for edge computing scenarios.

Additionally, Portainer’s Agent is extremely lightweight, using only about 10MB of memory. Only this single Agent needs to be deployed at the edge — which is why it can run on edge devices with limited hardware resources. The only dependency is that Docker must be installed on the edge device.

Open Source Edition Edge Features

Edge Agent default network policy: pull
Edge Compute modules and menus:
- Edge Groups: Group edge devices/environments statically or dynamically into Edge Groups for batch management
- Edge Stacks: Similar to Docker Compose — push a set of edge services to one or all Edge Groups
- Edge Jobs: Similar to crontab — run batch jobs on edge devices
Edge emergency feature support:
- Intel OpenAMT
- FDO
Manage edge OS-level filesystem (implemented via Docker Linux socket)

With these features, Portainer can:

Maintain connectivity and management between edge devices and the Portainer Server through pull heartbeats under complex network conditions
Achieve batch service and job deployment to edge devices through Edge Groups / Edge Stacks / Edge Jobs

However, the open source edition has significantly fewer features compared to the commercial edition. For example, the open source edition lacks one-click batch onboarding and other important edge features.

Commercial Edition Additional Features

One-click onboarding: batch installation of edge devices
Secure communication: tunnel-based connections under complex and offline network conditions
Edge Devices: edge device management
Waiting Room: selectively accept edge device connection requests to the server

Summary

In my opinion, compared to the Rancher + K3s approach, Portainer’s edge computing solution has the following key advantages:

Low resource consumption
Networking model optimized for edge networks

In more detail, these features make it more suitable for edge scenarios than Rancher + K3s:

Server in the “cloud”, only Agent at the “edge”
Lightweight Edge Agent — approximately 10MB runtime memory
Edge Agent pull model — strong adaptability to edge networks
No additional networking models introduced, no hard DNS dependency

However, the open source edition of Portainer has notable feature gaps. The most significant missing feature is:

One-click onboarding: batch installation of edge devices

HashiCorp Nomad for Edge Computing

Nomad Edge Reference Architecture

Since version 1.3, HashiCorp Nomad has added many practical features for edge scenarios:

1.3 introduced: Nomad native service discovery (eliminating the need for Consul in simple scenarios)
1.4 introduced:
- Health checks
- Nomad Variables (eliminating the need for Vault in simple scenarios)
1.5 introduced:
- Dynamic node metadata for easier dynamic node management
1.6 introduced:
- Node Pool concept for easier batch node management

These improvements mean that at the edge, there is no longer a dependency on:

Consul
Vault

With just the Nomad Agent alone, you can achieve:

Container orchestration and management
Basic service discovery and management
Variables / environment variables / configuration management

Similar to Portainer, it only requires a single Edge Agent. Memory consumption is approximately 20–40 MB.

Nomad supports geographically distant clients, meaning the Nomad server cluster does not need to run near the clients. (K8s simply cannot do this.)

Additionally, disconnected client allocations can reconnect gracefully, handling situations where edge devices experience network latency or temporary connectivity loss.

Two Nomad parameters deserve special mention here:

max_client_disconnect

Without this parameter, Nomad runs its default behavior: when a Nomad client’s heartbeat fails, Nomad marks the client as down and its allocations as lost. Nomad automatically schedules new allocations on another client. However, if the downed client reconnects to the server, it shuts down its existing allocations. This is suboptimal because Nomad stops running allocations on the reconnected client only to place the same allocations again. (K8s behaves the same way, and can only behave this way.)

For many edge workloads — especially those with high latency or unstable network connections — this is disruptive, because a disconnected client doesn’t necessarily mean the client is down. Allocations can continue running on temporarily disconnected clients. For these cases, you need to set the max_client_disconnect parameter to gracefully handle disconnected client allocations.

With max_client_disconnect set, when a client disconnects, Nomad still schedules allocations on another client. However, when the client reconnects:

Nomad marks the reconnected client as ready.
If there are multiple job versions, Nomad selects the latest job version and stops all other allocations.
If Nomad rescheduled the lost allocation to a new client and the new client has a higher node rank, Nomad continues the allocation on the new client and stops all others.
If the new client has a lower node rank or there is a tie, Nomad resumes the allocation on the reconnected client and stops all others.

This is the preferred behavior for edge workloads with high latency or unstable network connections, especially when disconnected allocations are stateful.

For example:

An edge device is running a web service. The edge device then loses connectivity with the (edge container management) server.

In K8s, the node enters Unknown or NotReady status. The web service is considered down, and a new instance is started on another edge device. After reconnection, the system finds the latest instance is on the other device, so the original device’s service is shut down. For users of that web service, they may find the service unavailable (shut down by the management plane) after the edge device reconnects.
In Nomad with this parameter enabled, the node enters lost status and the allocated service enters Unknown status. A new web service is started on another edge device. After reconnection, the system finds the web service is still running normally on the original device and shuts down the later-started instance. For users, the experience is uninterrupted.

Template change_mode

Another important parameter is change_mode under the Template block.

Set the Template section’s change_mode to noop. By default, change_mode is set to restart, which causes the task to fail if the client cannot connect to the Nomad server. Since Nomad schedules this job on edge data centers, if the edge client disconnects from the Nomad server (and thus from service discovery), the service will use the previous template configuration.

Summary

In my opinion, compared to the Rancher + K3s approach, HashiCorp Nomad’s edge computing solution has the following key advantages:

Low resource consumption — only a Nomad Agent at the edge
Management server and agents can be physically far apart — edge devices across the globe can be managed from a single cloud center
Parameters specifically optimized for edge networks — such as max_client_disconnect

In more detail, these features make it more suitable for edge scenarios than Rancher + K3s:

Server in the “cloud”, only Agent at the “edge”
Lightweight Nomad Agent — approximately 20–40MB runtime memory
Nomad Agent heartbeat with server is pull-based — strong adaptability to edge networks
Purpose-built max_client_disconnect parameter for edge scenarios
No additional networking models introduced, no hard DNS dependency

Additionally, compared to the open source edition of Portainer, Nomad has another advantage:

One-click onboarding: batch installation or even pre-installation of edge devices

Thanks to Nomad’s well-designed architecture, it is inherently built for large-scale container orchestration. It supports one-click batch deployment via Terraform or Ansible, and even pre-installation (flashing). A typical Nomad Agent configuration is as simple as:

data_dir  = "/opt/nomad/data"
bind_addr = "0.0.0.0"

client {
  enabled = true
  servers = ["<nomad_server_ip_list>"]
}

Conclusion

In the field of IoT edge container cluster management, K8s-based solutions (including Rancher + K3s) are clearly not well-suited. After roughly two years of hands-on experience, I encountered too many pitfalls. The main reasons are:

High resource consumption
High network requirements
Complex networking model
Poor self-healing capability
Too many additional components introduced

However, HashiCorp Nomad and Portainer have clear advantages over K8s/K3s in the IoT/edge computing domain. They are worth trying because they are:

Lightweight: just a single Agent with minimal memory footprint
Specifically optimized for edge networks
No complex networking models introduced (no Service Network, Pod Network, Overlay Network… — primarily relying on Host Network, container port mapping, and at most a bridge network), no DNS dependency
Strong self-healing capability
No excessive additional components — just one extra Agent

My personal recommendation is to use Nomad (for small-scale or home scenarios, Portainer is a good choice).

That’s all.

If you have better experiences to share, feel free to discuss~

CloudNative

#K8S #CloudNative #Container #Docker #K3S #IoT #Edge

Architectures for Large-Scale IoT Edge Container Cluster Management - 6 - Personal Recommendations

https://e-whisper.com/posts/62154/

Author

east4ming

Posted on

August 26, 2023

Licensed under

Nomad Series - Installation Previous

Fun with PI Series - An ARM Dev Board Matrix That Looks Like a Server - Firefly Cluster Server Next

Architectures for Large-Scale IoT Edge Container Cluster Management - 6 - Personal Recommendations

Previous Articles

Overview

Problems with the Rancher + K3s Model at the Edge

Edge Storage Capacity

Insufficient Edge Storage Performance

High Network Requirements Between K3s Master and Workers at the Edge

High Network Requirements Between Cloud and Edge, with No Self-Healing After Failures

Poor Edge Self-Healing Capability

High CPU and Memory Requirements at the Edge

Complex Edge Networking Model

Summary

Edge Computing Requirements for Container Orchestration

Portainer for Edge Computing

Open Source Edition Edge Features

Commercial Edition Additional Features

Summary

HashiCorp Nomad for Edge Computing

max_client_disconnect

Template change_mode

Summary

Conclusion