system-design
The Ultimate Guide to Engineering Design Docs
• 3 min read
How to write RFCs and Design Docs that save weeks of coding time. Structure, examples, and best practices.
The Ultimate Guide to Engineering Design Docs
The most expensive mistake in software engineering is writing weeks of code for the wrong solution. The strict discipline of writing a Design Document (also known as an RFC - “Request for Comments”) is the antidote to this common failure mode.
“Code is easy. It’s the thinking that’s hard.”
A Design Doc is not documentation; it is a tool for thinking.
Why Write Them?
- Async Consensus: Avoiding 2-hour meetings where nothing is decided.
- Historical Context: 6 months from now, you will ask, “Why on earth did we choose NoSQL?” The doc answers that.
- Force Multiplier: It allows senior engineers to scale their impact by reviewing architecture without reading every line of code.
The Anatomy of a Perfect Design Doc
A great design doc follows a predictable structure.
1. Context & Scope
- Objective: A 2-sentence summary of what we are building.
- Background: Why are we doing this? Link to product specs or tickets.
- Non-Goals: Explicitly state what you are not doing. This prevents scope creep.
- Success Criteria: How will we measure if this design worked? (e.g., “P99 latency stays under 200ms at 10k RPS”).
2. The Proposed Design
This is the meat of the document.
- High-Level Architecture: A diagram (Mermaid.js or system diagram) showing how components interact.
- API Design: Define the endpoints (
GET /users/:id), payloads, and error codes. - Data Model: Schema definitions, table relationships, and storage choices.
3. Alternatives Considered (The Most Important Section)
This is where you prove you did your homework. Don’t just list your choice; list the other valid ways to do it and why you rejected them.
| Approach | Pros | Cons | Verdict |
|---|---|---|---|
| Option A: Use Redis | fast, simple | data loss risk | Rejected |
| Option B: Postgres | reliable, relational | slower writes | Selected |
4. Operational Excellence
Often ignored, but critical for keeping the system running.
- Rollback Plan: If we deploy this and the database CPU spikes to 100%, how do we undo it immediately?
- Scalability: How does this design handle a 10x traffic spike? Where is the first bottleneck?
- Cost Estimate: Will this increase our cloud bill significantly? (e.g., “Adding 5TB of SSD storage = +$300/mo”).
5. Cross-Cutting Concerns
Do not skip these.
- Security & Privacy: AuthZ/AuthN, PII handling. Vital in 2026: Include Data Residency (GDPR/CCPA) and AI compliance.
- Observability: What metrics will we track? How do we know it’s broken?
- Migration: How do we move from the old system to the new one with zero downtime?
A Template for You
Feel free to copy this markdown for your next RFC.
# [RFC] Title of Feature
## Summary
Brief explanation...
## Motivation
Why are we doing this?
## Detailed Design
### API
### Database
## Alternatives Considered
1. ...
2. ...
## Security & Privacy
...
The Review Process
Sending the doc is just the beginning.
- Rule of 24h: Reviewers should aim to provide feedback within 24 hours.
- Comment on Logic, Not Grammar: Focus on race conditions, scalability bottlenecks, and data integrity.
- Resolve Conflicts: If a comment thread gets too long, hop on a call to resolve it, then document the decision back in the doc.
Writing is the highest leverage skill for a software engineer. Master the design doc, and you master the ability to influence technical direction.
Subscribe on Substack
Get the latest posts delivered right to your inbox.
🎉 Thanks! Please check your inbox to confirm.
Comments