Bug Investigation Tutorial¶
Systematically investigate bugs using rp1's hypothesis-driven debugging workflow. This tutorial walks you through diagnosing a production issue from symptoms to root cause.
Time to complete: ~20-30 minutes
What You'll Learn¶
- How rp1's bug investigation command works
- The hypothesis-driven approach to debugging
- Evidence gathering and root cause analysis
- Using knowledge base context for codebase understanding
- Interpreting investigation reports
Prerequisites¶
Before You Begin
- rp1 installed (Quick Start)
- Knowledge base generated (
/knowledge-build) - A bug or issue you want to investigate
The Scenario¶
We'll investigate an API timeout issue in a web application. This example was chosen because:
- Timeout bugs are common in production systems
- They require understanding multiple layers (network, database, application)
- The root cause often isn't obvious from the symptoms
Symptoms: Users report that the /api/orders endpoint intermittently returns 504 Gateway Timeout errors, typically during peak hours.
The Investigation Workflow¶
flowchart LR
S[Symptoms] --> H[Hypotheses]
H --> E[Evidence]
E --> A[Analysis]
A --> R[Root Cause]
| Phase | Purpose | Output |
|---|---|---|
| Symptoms | Define what you're seeing | Problem description |
| Hypotheses | Generate possible causes | Ranked hypothesis list |
| Evidence | Gather supporting data | Code analysis, logs, patterns |
| Analysis | Evaluate each hypothesis | Confirmation or elimination |
| Root Cause | Identify the actual cause | Investigation report |
Step 1: Start the Investigation¶
Launch the bug investigator with a description of the issue:
What happens:
rp1's bug investigator agent:
- Loads your knowledge base context
- Analyzes the symptom description
- Searches your codebase for relevant code
- Generates initial hypotheses
What to expect:
๐ Bug Investigation Started
Problem: API timeout on /api/orders endpoint during peak hours
Loading knowledge base context...
โ Architecture understood
โ Relevant modules identified
Searching codebase for relevant code...
โ Found: src/api/orders.ts
โ Found: src/services/orderService.ts
โ Found: src/db/queries/orderQueries.ts
Generating hypotheses...
Checkpoint
The investigator should identify files related to the orders endpoint. If it's searching in wrong locations, ensure your knowledge base is up to date.
Step 2: Review Hypotheses¶
The investigator generates ranked hypotheses based on your codebase:
What to expect:
๐ Hypotheses Generated (4)
H1 [HIGH]: Database query N+1 problem
- Evidence: orderQueries.ts uses sequential queries in a loop
- Confidence: 75%
- Would explain: Load-dependent timeouts
H2 [MEDIUM]: Missing database index
- Evidence: orders table queried by customer_id without index check
- Confidence: 60%
- Would explain: Slow queries during high volume
H3 [MEDIUM]: Connection pool exhaustion
- Evidence: Pool size set to 10, no queue configuration
- Confidence: 55%
- Would explain: Intermittent failures under load
H4 [LOW]: External API call blocking
- Evidence: orderService calls inventory API synchronously
- Confidence: 40%
- Would explain: Variable response times
Understanding confidence scores:
- 75%+: Strong evidence in codebase, likely cause
- 50-74%: Supporting evidence, needs verification
- Below 50%: Possible but less likely given available evidence
Step 3: Evidence Gathering¶
The investigator gathers evidence for each hypothesis by examining your code:
What to expect:
๐ Gathering Evidence for H1: Database query N+1 problem
Analyzing: src/db/queries/orderQueries.ts
Found pattern at lines 45-52:
```typescript
async function getOrdersWithItems(customerIds: string[]) {
const orders = await db.query('SELECT * FROM orders WHERE customer_id IN (?)', [customerIds]);
// N+1 pattern detected: loop with individual queries
for (const order of orders) {
order.items = await db.query('SELECT * FROM order_items WHERE order_id = ?', [order.id]);
}
return orders;
}
Evidence strength: STRONG - Sequential queries inside loop (line 48-50) - No batch loading of order items - Query count = 1 + N (where N = number of orders)
!!! info "How Evidence is Gathered"
The investigator uses your knowledge base to understand your architecture, then performs targeted code searches. It examines actual code patterns, not just file names.
---
## Step 4: Hypothesis Testing
The investigator systematically tests each hypothesis:
**What to expect:**
H1 [Database N+1]: CONFIRMED โ - Code shows sequential queries in loop - 100 orders = 101 database queries - Peak hours have ~500 orders/minute - Estimated query load: 50,000 queries/minute at peak
H2 [Missing Index]: PARTIAL โ - Index exists on customer_id (found in migrations) - But compound index (customer_id, created_at) missing - May contribute but not primary cause
H3 [Connection Pool]: UNLIKELY โ - Pool size is 10, but connection time is fast - Pool would reject, not timeout - Error pattern doesn't match pool exhaustion
H4 [External API]: ELIMINATED โ - Inventory API call is after order fetch - Timeout occurs before this code path - Confirmed via error timestamps
---
## Step 5: Root Cause Report
The investigation concludes with a comprehensive report:
**What to expect:**
Problem: API timeout on /api/orders during peak hours
ROOT CAUSE: Database N+1 Query Pattern
Location: src/db/queries/orderQueries.ts:45-52
Summary: The getOrdersWithItems function executes 1 query to fetch orders, then N additional queries to fetch items for each order. During peak hours (~500 orders/minute), this generates ~50,000 queries/minute, overwhelming the database connection and causing timeouts.
Contributing Factor: Missing compound index on (customer_id, created_at) causes table scans on the initial orders query.
Recommended Fixes:
-
PRIMARY: Refactor to batch query
-
SECONDARY: Add compound index
Verification Steps: 1. Add query logging to confirm query count 2. Test with production-like data volume 3. Monitor response times after fix
Files to Modify: - src/db/queries/orderQueries.ts (primary fix) - migrations/xxx_add_compound_index.sql (secondary fix)
Output: .rp1/work/investigations/api-timeout-orders/report.md ```
Checkpoint
The investigation report should identify a clear root cause with specific file locations and line numbers. If the report is vague, provide more context about your symptoms.
Acting on Findings¶
After the investigation, you have several options:
Option A: Quick Fix¶
If the fix is straightforward, implement it directly:
bash
/code-quick-build "Fix the N+1 query in orderQueries.ts by batching order_items fetch"
bash
/rp1-dev/code-quick-build "Fix the N+1 query in orderQueries.ts by batching order_items fetch"
Option B: Formal Feature¶
For larger fixes, use the feature workflow:
bash
/feature-requirements fix-orders-performance "Based on investigation report, fix the N+1 query pattern and add missing index"
bash
/rp1-dev/feature-requirements fix-orders-performance "Based on investigation report, fix the N+1 query pattern and add missing index"
Option C: Further Investigation¶
If the root cause isn't clear, investigate specific areas:
bash
/code-investigate "Deep dive into database connection pool behavior under load"
bash
/rp1-dev/code-investigate "Deep dive into database connection pool behavior under load"
Summary¶
You've learned the bug investigation workflow:
| Phase | What Happens | Key Insight |
|---|---|---|
| Start | Describe symptoms | Be specific about when/where |
| Hypotheses | Agent generates theories | Ranked by confidence |
| Evidence | Code analysis | Actual patterns found |
| Testing | Evaluate each hypothesis | Confirm or eliminate |
| Report | Root cause identified | Actionable fix paths |
Key Benefits¶
- Systematic - No guessing, evidence-based analysis
- Context-aware - Uses your KB to understand architecture
- Documented - Full report for team review
- Actionable - Specific files and line numbers
Next Steps¶
- Fix the bug: Use code-quick-build for simple fixes
- Larger fixes: Use the Feature Development workflow
- Review the PR: After fixing, use PR Review to verify the change
- Reference: See code-investigate for full command options
Troubleshooting¶
Investigation isn't finding relevant files
Your knowledge base may be outdated. Regenerate it:
bash
/knowledge-build
Or provide more specific context in your problem description.
Hypotheses don't match my intuition
The investigator generates hypotheses based on code evidence. If you suspect something specific, include it:
bash
/code-investigate "API timeout - I suspect it's related to the Redis cache invalidation"
Report suggests multiple possible causes
This is common for complex bugs. The report ranks causes by confidence. Start with the highest-confidence fix and verify if it resolves the issue.
Can I investigate without a knowledge base?
Yes, but results will be less accurate. The KB provides architectural context that helps the investigator understand your codebase patterns.