Can AI Replace Ops Engineers?
This article was last updated on: May 17, 2026 am
Yes, it can.
With only one prerequisite:
Your company doesn’t adopt a “defensive ops” strategy.
│ 📝 Disclaimer:
│
│ - Artisanal craftsmanship, purely handwritten
│ - This article is 100% written by me manually
│ - This article is NOT AI-generated
Background
The trend of AI + AI IDE/CLI replacing developers is already quite obvious.
As an ops engineer, staying vigilant in times of peace, I naturally started seriously 🤔 thinking about this question: Can AI replace ops engineers?
To find out, I handed several real-world cases to AI for execution, including common ops tasks:
- Database migration
- Application upgrade
- Deploying new services
- …
The result is that I greatly underestimated AI’s capabilities — the actual outcome was even better than my best-case expectations.
Let’s take a look.
Real-World Cases
Case 1: Upgrading LobeChat from v1 to v2
Introduction to LobeChat
LobeHub (called LobeChat in v1, renamed to LobeHub in v2) is practically tailor-made for tinkerers like us. Honestly, switching back and forth between windows with ChatGPT is too cumbersome. But LobeHub is different — it lets you build your own AI team.
Imagine: you can create an Agent dedicated to writing code, one responsible for document organization, and another to help with data analysis — and they can collaborate with each other! It feels like playing StarCraft, except your “units” are all AIs.
What excites me most is its self-hosting capability. A single Docker command gets the entire service up and running, with data completely under your own control. For those concerned about privacy, this is a godsend.
I mainly use these features: various assistants (roasting people, polishing articles, analyzing international affairs, Wang Yangming’s philosophy of mind teachings, financial management…), plus RAG document resource management capabilities.
If you’re also tired of switching between various AI tools, or want a fully private AI workspace, I strongly recommend trying LobeHub. The 74.4k stars on GitHub aren’t for nothing, and the community is very active.
One-line summary: LobeHub transforms you from “using AI” to “managing an AI team”, fully self-hosted with your data under your own control.
My upgraded LobeChat v2 looks like this:

My Deployment Setup
LobeChat v1 had a Docker Compose deployment option. I rewrote it as K8s Manifests and deployed it on my Homelab. See my previous configuration here: homelab2/apps/lobe-chat at 3855a4c141a4c9cd8c503d891be38a032766bb15 · east4ming/homelab2
What Changed in V2?
│ 📚️ Reference documentation:
│
│ Migrate from v1.x Local Database to … · LobeHub
LobeChat v2 underwent massive changes, making this migration so challenging that even I found it daunting 😖:
- PostgreSQL needed to be upgraded from 16 to 17, and not vanilla PostgreSQL — from pgvector 16 to ParadeDB 17 😱
- Data must not be lost
- LobeChat’s authentication system underwent a major overhaul: switching from NextAuth to Better Auth
- The official documentation (referenced above) is just one page — a descriptive overview rather than detailed step-by-step instructions. And it doesn’t apply to my case since I’m not using Docker Compose for deployment… 😑
AI Enters the Stage
🎉 Despite all these difficulties, with AI’s help, the migration was bumpy but ultimately completed successfully 🎉🎉🎉
I used Claude Code with DeepSeek via API as the model (later, after trying other models, I realized DeepSeek isn’t currently in the top tier — but even so, it did a great job). I also used the planning-with-files skill:
plaintext
task_plan.md → Track phases and progress
findings.md → Store research and findings
progress.md → Session log and test results
The reason for using the planning-with-files skill is:
- This is a very challenging ops migration task that consumes a large amount of context
- Ops work is like this — you need proper planning
- Using this skill ensures there are requirements, design plans, and most importantly: migration progress is tracked in real-time through tasks, so context is never lost
planning-with-files
AI first planned these 3 files:
-
homelab2/apps/lobe-chat/docs/migration/v2/findings.md at master · east4ming/homelab2
-
homelab2/apps/lobe-chat/docs/migration/v2/progress.md at master · east4ming/homelab2
-
homelab2/apps/lobe-chat/docs/migration/v2/task_plan.md at master · east4ming/homelab2
I won’t paste the full content here to save readers’ time. If you’re interested, click the links above 👆 to check them out.
findings
Summary of the Lobe Chat v1 to v2 Production Migration Plan
-
Core Objectives
-
Key Changes and Requirements
-
Completed Preparation Work
-
Technical Decisions and Risk Mitigation
-
Follow-up Plans
progress
Migration Project Summary Report
-
Project Overview
-
Key Steps and Results
- Preparation and Assessment (Phase 1)
- Database Upgrade (Phase 2)
- Authentication System Migration (Phase 3)
- Deployment and Verification (Phase 4)
- Wrap-up and Monitoring (Phase 5)
-
Final Status
task_plan
-
Task Overview
-
Key Progress and Completion Status
-
Core Decisions and Considerations
-
Environment Information
-
Production domain: west-beta.ts.net
-
Network configuration: Tailscale Ingress and ExternalServices
-
User email
-
-
Summary
To sum up, objectively speaking, it wrote much better than I could (otherwise I wouldn’t still be an ops engineer 😂), and the considerations were very thorough.
Execution
The execution process followed the planned documentation step by step.
I took a shortcut here — I manually disabled ArgoCD’s auto-sync feature first. Then I had AI modify the K8s manifests, and after modifications, deployed directly via kubectl commands or executed PostgreSQL migration commands.
In the end, it was indeed completed successfully. 🎉🎉🎉

It also generated additional related documentation:
- homelab2/apps/lobe-chat/docs/migration/v2/生产环境监控检查清单.md at master · east4ming/homelab2
- homelab2/apps/lobe-chat/docs/migration/v2/用户反馈收集模板.md at master · east4ming/homelab2
- homelab2/apps/lobe-chat/docs/migration/v2/用户迁移通知模板.md at master · east4ming/homelab2
- homelab2/apps/lobe-chat/docs/migration/v2/紧急回滚计划.md at master · east4ming/homelab2
- homelab2/apps/lobe-chat/docs/migration/v2/迁移总结报告.md at master · east4ming/homelab2
Honestly, I could have thought of items 1, 4, and 5, but “User Feedback Collection” and “User Migration Notification” were aspects I genuinely wouldn’t have considered 😂.
👍️ AI Strengths
- AI can complete 90% of the work (the remaining 10% required my intervention. I believe the reason isn’t AI’s fault — it’s that my GitOps repo lacked visibility in certain areas, causing AI to not understand those aspects and make misjudgments. More details below)
- AI produced very thorough planning
- AI kept documentation updated in real-time before, during, and after the migration (I can only guarantee keeping an Excel todo checklist updated at best)
👎 Weaknesses
- Although my deployment information is mostly in Git, there were still some gaps, causing AI to potentially lack critical production environment information (e.g., data backup mechanisms, secrets, data in PVs) (this is my fault)
- Despite the well-planned documentation constantly emphasizing data importance, during execution it would still casually use commands like rm, delete, drop. So permissions must be strictly controlled
- This is currently AI’s biggest problem — it will deceive. For steps that didn’t succeed, AI would sometimes pretend not to notice, continue executing subsequent steps, and mark them as ✔️ completed in the documentation. (For example, my database dump restore failed, and it just skipped it and marked it as complete 😂😂😂)
- AI’s 128K context was exhausted multiple times, potentially causing loss of critical information. So you must tell AI to document as it goes
🫣 Detailed records are here:
Merge pull request ‘feat(lobe-chat): 实现v2迁移准备工作’ (#312) from lobehub-… · east4ming/homelab2@ba0d16c

Summary
AI spent one full day and ¥6.50 to successfully complete this task.

Case 2: Deploying a New Service — Online Documentation Website
Compared to the previous task, this one was relatively simpler. AI performed even better — 100% completion with zero human intervention!
Task Overview
This was a project at my company. I have a gitops-monitor repo at work that contains everything I’m responsible for in ops. I wanted to generate an online documentation website using MkDocs, based on the Markdown documents in my repo, with bilingual Chinese-English support.
This time I switched to using the company-provided Kiro IDE.
AI Enters the Stage
Kiro Specs
Kiro Specs first generated 3 documents:
Requirements → Design → Tasks
- Requirements: equivalent to the refined requirements you produce after business/leadership raises a need
- Design: equivalent to your migration plan
- Tasks: equivalent to the actual Excel task checklist used during migration
Requirements
Requirements document contents:
- Introduction
- Glossary
- Requirements
- MkDocs Project Configuration
- Chinese-English Bilingual Support
- Git Branch Management
- Kubernetes Deployment Manifests
- AWS ALB Ingress Configuration
- MkDocs Static Files and Deployment Process
Here you can see it refined my vague description into specific, concrete requirements. And each requirement has: user stories and acceptance criteria. 👍️
Design
Technical Design Document
- Overview
- Architecture (here AI drew a flowchart 👍️)
- Component Design
- MkDocs Configuration — mkdocs.yml
- Kubernetes Deployment Manifests — planned to write a mkdocs Helm chart (including: Deployment, PVC, ConfigMap, Service, Ingress)
- ArgoCD Auto-discovery (AI analyzed and found that ArgoCD uses auto-discovery, so no additional ArgoCD configuration was needed)
- Build and Deployment
- i18n Bilingual Implementation
- Git Branch Strategy (feature branch for development, merged to main branch via PR for deployment)
- Implementation Tasks
Tasks
Implementation Plan
- Overview
- Task Checklist (listed 7 major tasks and more subtasks, with inter-task dependencies noted, and real-time updates upon completion — - [x])
- Notes
Execution
The coding phase doesn’t need much elaboration — this is AI’s forte. It wrote:
- mkdocs.yml
- Helm chart
It completed the task excellently with zero human intervention. 🎉🎉🎉
👍️ Strengths
- AI can complete 100% of the work with zero human intervention
- AI produced very thorough planning
- AI kept documentation updated in real-time before, during, and after the migration
👎 Weaknesses
- None
Summary
This task was simpler, the repo had complete information, and the model used was stronger. AI perfectly 👐 completed the task. Zero weaknesses.
Answering the Question
Q: Can AI replace ops engineers?
A: Yes. (Not partially — fully, 100%.)
With only one prerequisite:
Your company doesn’t adopt a “defensive ops” strategy.
What Does “Defensive Ops” Mean?
Any ops anti-pattern:
- Ops code is invisible (your ops code isn’t visible, not in a Git repo, no CMDB, no change records)
- Configuration drift (your ops information is visible but inaccurate compared to the actual production environment)
- Silos (your ops is an island — a legacy system, a relic of a bygone era, an antique, a bizarre and strange existence)
- Architectural chaos (your ops has no architecture, no well-designed architecture, no stable and immutable architecture, no robust architecture)
- Unstructured
- Non-standardized (can only click through UIs, no standard API interfaces)
- Unobservable
- Pure manual craftsmanship — no automation, no IaC or GitOps, not even configuration management tools like Ansible
- No single source of truth for ops information
- Environment inconsistency (dev, test, UAT, performance, and production are inconsistent)
- No version control
- Ops information is unreadable and incomprehensible
- …
Conclusion
I believe, through the two cases above 👆, we can clearly arrive at this answer: AI can replace ops engineers.
- For complex migration work, AI spent one day and ¥6.50 to complete the job
- For moderately difficult new service deployment, AI completed the work 100% flawlessly with zero human intervention
- A monthly Coding Plan subscription costs about $20, which can replace an ops colleague (whose company labor cost runs into the tens of thousands)
- AI writes better documentation
- AI updates status more promptly
- AI considers things more comprehensively
- AI produces better upward management reports
- …
Finally, here’s a funny meme to lighten the mood~

That said, we ops engineers don’t need to feel anxious or worried.
I’m reminded of Liu Yuxi’s famous verse: “A thousand sails pass by the sunken ship; ten thousand trees thrive before the withered one.”
Technological iteration has never been about elimination — it’s about ecosystem renewal. AI is like those thousand sails, like those thriving trees. What it replaces is only the repetitive “operations,” but it can never replace the system intuition that ops engineers have honed through countless late-night incidents, the architectural wisdom grown in the cracks between business and technology, and the sense of responsibility that says “dare to restart, dare to take the blame” 🤨 when facing unknown risks.
Look at it from another angle: the incident database you’ve accumulated over the years, the topology diagrams you’ve drawn by hand, the “sixth sense” that lets you pinpoint root causes instantly amid a flood of alerts — these are experiential assets that data alone cannot fully replicate. AI will ultimately be your “super collaborator” — hand off the tedious work and let yourself focus more on creation and decision-making.
So, let’s encourage ourselves with another verse: “Heaven has endowed me with talents that must be put to use; gold scattered far and wide will all come back to me.”
Your “talents” have never been just commands and scripts — they are your ability to navigate complex systems with ease, and your soft skills in building bridges between technology and people. These are things AI cannot learn, nor take away.
The value of ops engineers lies not in the tools, but in the composure shown every time a crisis is averted.
Let us take heart together.
EOF