Failure Isn't Shameful, Hiding It Is: Lessons from Cloudflare's Crisis Communication
This article was last updated on: May 17, 2026 am
Failure Isn’t Shameful, Hiding It Is: Lessons from Cloudflare’s Crisis Communication
Honestly, when I first saw Cloudflare’s blog post “Code Orange: Fail Small is complete,” my initial reaction was 🤔 — is this about technical transparency or crisis PR? After reading it through, I was impressed. These two things are really just two sides of the same coin. Today let’s talk about how Cloudflare wins trust through “proactive disclosure,” and what we ops and cloud service folks can learn from it.
By the way, I’ve previously discussed SRE post-incident review culture on my blog (ewhisper.cn), and this Cloudflare case makes a perfect benchmark.
From “Covering Up” to “Proactive Disclosure” — Which Actually Works Better?
When traditional PR encounters an outage, what’s the first instinct? — Cover it up, try to sweep it under the rug. The usual playbook: “Some services are experiencing anomalies,” “Our technical team is urgently working on a fix,” “We sincerely apologize for any inconvenience.” And then? Nothing. Users are left confused: What actually went wrong? When will it be fixed? Will it happen again? Trust shatters instantly.
But Cloudflare is different. Their “Code Orange: Fail Small” project has a simple core philosophy:
Proactively discover and fix small-scale failures before they escalate into large-scale incidents; after the project is complete, publicly release a full report laying out the technical details, improvement measures, and even lessons from failures for users to see.
How bold is this move? It’s essentially “self-detonating” PR, but the results are surprisingly effective — no hiding, no sugarcoating, and trust is actually strengthened.
Breaking Down the Three-Move Strategy: Transparency, Speed, Continuity
Cloudflare’s blog post looks like a technical report on the surface, but it’s actually a textbook-level crisis communication playbook. I’ve summarized it into three pillars — let’s look at each one:
1. Transparent Communication: Admitting Shortcomings Is the First Step to Regaining Trust
In their report, Cloudflare directly listed previous shortcomings in network resilience — for example, insufficient disaster recovery capabilities in certain regions and incomplete failure isolation mechanisms for certain components. They didn’t dodge the issues but candidly acknowledged:
“Yes, we didn’t do well enough before. Now we’re going to fix it.”
│ 📝Notes: This is the same as our daily post-incident reviews — don’t try to shift blame, don’t try to “control the narrative.” The more you try to cover things up, the more users suspect something’s wrong. Learning to sincerely say “we’re sorry” beats everything else.
2. Rapid Response: Not “Monday Morning Quarterbacking,” but “Getting Ahead of the Problem”
The core of “Code Orange” isn’t post-hoc remediation — it’s proactive offense. Cloudflare’s engineering team spent over two quarters specifically identifying and fixing potential “small failures” in their systems — things like individual nodes with insufficient capacity, configuration errors, and route flapping that could be caught and fixed early.
This approach is called “Fail Small”: rather than waiting for problems to accumulate into a massive outage, proactively create small-scale failures and fix them quickly. It shares the same philosophy as “chaos engineering” in the Kubernetes world.
3. Continuous Improvement: Public Commitments That Force Execution
In their report, Cloudflare didn’t just list improvement measures — they made a commitment: “We will continuously monitor these improvements and regularly publish transparency reports.” This is essentially putting the tough talk upfront: if something goes wrong in the future, you have every right to hold us accountable.
This “public commitment” mechanism forces the team to actually follow through on improvements. Many companies do post-incident reviews where the reports are beautifully written, but the same incidents keep happening. Cloudflare’s public commitment approach is like putting a “binding spell” on themselves — if you don’t improve, users can point to the report and call you out.
Comparison: Traditional “Narrative Control” vs. Cloudflare’s “Proactive Disclosure”
| Dimension | Traditional PR | Cloudflare’s Proactive Disclosure |
|---|---|---|
| Attitude | Evasive, playing deaf and dumb | Candid, straightforward, proactively published |
| Timing | Post-hoc remediation, usually delayed | Pre-emptive prevention + real-time disclosure + post-event summary |
| User Perception | “This company is unreliable, hiding problems” | “This company is honest and trustworthy” |
| Long-term Effect | Trust erodes continuously, users churn | Trust solidifies, user loyalty increases |
| Examples | Most internet companies’ outage boilerplate statements | Cloudflare’s “Code Orange” report |
The conclusion is obvious: in an era of ever-increasing information transparency, “narrative control” — that old 2010-era playbook — no longer works. Cloudflare’s “proactive disclosure” strategy actually builds deeper trust.
What Does This Mean for Us?
Honestly, those of us in operations may not directly face end users, but we do face business stakeholders, product managers, and executives — and they’re “users” too. Their trust needs to be cultivated just the same.
So, what can we do?
-
Establish an “Operations Transparency Report” mechanism: Every quarter or month, send business stakeholders an “Operations Health Report” (metrics-based), including:
- SLO/SLI compliance rates for key services
- Post-incident review summaries for recent failures (highlights, shortcomings, improvement measures)
- Reliability improvement plans for the next phase
-
Promote a “Post-Incident Review Culture”: Don’t be afraid to expose problems. Reviews are not about assigning blame — they’re about finding systemic root causes. Like Cloudflare, make review reports public (at least to business stakeholders), so everyone knows: we see the problems, and we’re fixing them.
-
Establish open communication channels with business stakeholders: For example, a dedicated “Operations Monthly Report” internal newsletter / knowledge base column, or hold regular “Operations Open Houses.” Proactively tell them what we’re doing, what challenges we’re facing, and how we plan to solve them.
Honestly, most teams practice “defensive operations” — only fighting fires when things break, then pretending nothing happened once the fire’s out. This model will inevitably lead to a “trust collapse.” What Cloudflare is doing is actually “defensive PR” — trading transparency for trust, trading openness for credibility.
Final Thoughts
“Transparency is the new objectivity.” — David Weinberger
This applies to technology iteration, and it applies equally to rebuilding trust. Failure isn’t scary — what’s scary is failing and then being stubborn about it, still trying to cover it up. Cloudflare’s “Code Orange” tells us: failure is actually pretty cool (if you learn something from it).
I hope you too can learn to “Fail Small” like Cloudflare and win “Big Trust.”
Let’s encourage each other.
That’s all.