Design Your Agent
Module 10: Learning Objectives
By the end of this module, you will:
- ✓ Design a complete autonomous software engineering agent
- ✓ Implement multi-agent orchestration with specialized roles
- ✓ Integrate all concepts from previous chapters
- ✓ Deploy a production-ready agent system
- ✓ Evaluate and iterate based on real-world testing
Capstone Project: Autonomous Software Engineering Agent
Welcome to the capstone project! You’ll build a sophisticated agent that can analyze codebases, identify issues, propose fixes, write tests, and refactor code autonomously.
Project Overview
What We’re Building
An Autonomous Software Engineering Agent that can:
- Analyze code quality and identify bugs
- Generate fixes with explanations
- Write comprehensive tests
- Refactor code for better maintainability
- Review pull requests
- Learn from feedback
Why This Project?
This capstone integrates nearly everything from the course:
- ReAct pattern (Module 2): Reasoning and acting on code
- Planning (Module 3): Breaking down complex refactoring tasks
- Memory (Module 3): Remembering codebase patterns and past fixes
- Code execution (Module 4): Running and validating code
- Production patterns (Module 5): Safety, testing, monitoring
- Specialized agents (Module 6): Coding agent capabilities
- Learning (Module 7): Adapting from feedback
- Enterprise scale (Module 8): Handling large codebases
- Frontier capabilities (Module 9): Self-improvement, tool creation
Requirements Gathering
Functional Requirements
Core Capabilities:
- Code Analysis: Parse and understand code structure
- Bug Detection: Identify potential issues
- Fix Generation: Propose and implement fixes
- Test Generation: Create comprehensive tests
- Refactoring: Improve code quality
- PR Review: Analyze changes and provide feedback
User Interactions:
- Natural language commands (“Fix the bug in auth.py”)
- File/directory targeting
- Interactive clarifications
- Progress reporting
- Explanation of changes
Non-Functional Requirements
Performance:
- Analyze files < 5 seconds
- Generate fixes < 30 seconds
- Handle codebases up to 100K lines
Reliability:
- Never break working code
- Validate all changes
- Rollback capability
- 95%+ test coverage for generated code
Safety:
- Sandbox code execution
- No destructive operations without confirmation
- Backup before modifications
- Security vulnerability checks
Usability:
- Clear explanations
- Confidence scores
- Alternative solutions
- Learning from user feedback
Architecture Design
High-Level Architecture
graph TB
UI[User Interface Layer]
UI --> ORC[Orchestration Layer]
subgraph Orchestration
ORC --> PLAN[Planner]
ORC --> ROUTE[Router]
ORC --> MON[Monitor]
end
subgraph Agents
ROUTE --> ANA[Analyzer Agent]
ROUTE --> FIX[Fixer Agent]
ROUTE --> TEST[Tester Agent]
ROUTE --> REF[Refactorer Agent]
ROUTE --> REV[Reviewer Agent]
end
subgraph Tools
ANA --> AST[AST Parser]
FIX --> EXEC[Code Executor]
TEST --> RUNNER[Test Runner]
AST --> LINT[Linter]
EXEC --> GIT[Git Ops]
end
subgraph Storage
MON --> VDB[(Vector DB)]
MON --> CACHE[(Code Cache)]
MON --> FB[(Feedback DB)]
end
style UI fill:#dbeafe
style ORC fill:#fef3c7
style ANA fill:#d1fae5
style FIX fill:#d1fae5
style TEST fill:#d1fae5
Component Design
1. Orchestration Layer
from typing import Dict, List, Optional
from dataclasses import dataclass
from enum import Enum
class TaskType(Enum):
ANALYZE = "analyze"
FIX_BUG = "fix_bug"
WRITE_TEST = "write_test"
REFACTOR = "refactor"
REVIEW_PR = "review_pr"
@dataclass
class Task:
type: TaskType
target: str # File or directory
description: str
priority: int
dependencies: List[str]
class Orchestrator:
"""Coordinates multiple specialized agents"""
def __init__(self):
self.planner = TaskPlanner()
self.router = AgentRouter()
self.monitor = ProgressMonitor()
def execute_request(self, request: str, context: Dict) -> Dict:
"""Main entry point"""
# Plan tasks
tasks = self.planner.create_plan(request, context)
# Execute tasks
results = []
for task in tasks:
# Route to appropriate agent
agent = self.router.get_agent(task.type)
# Execute
result = agent.execute(task)
results.append(result)
# Monitor progress
self.monitor.update(task, result)
# Synthesize results
return self.synthesize_results(results)
2. Agent Layer
class AnalyzerAgent:
"""Analyzes code quality and identifies issues"""
def execute(self, task: Task) -> Dict:
# Parse code
# Run static analysis
# Identify issues
# Prioritize findings
pass
class FixerAgent:
"""Generates and applies fixes"""
def execute(self, task: Task) -> Dict:
# Understand issue
# Generate fix
# Validate fix
# Apply changes
pass
class TesterAgent:
"""Writes tests for code"""
def execute(self, task: Task) -> Dict:
# Analyze code
# Identify test cases
# Generate tests
# Validate coverage
pass
class RefactorerAgent:
"""Refactors code for quality"""
def execute(self, task: Task) -> Dict:
# Identify code smells
# Plan refactoring
# Apply transformations
# Verify behavior preserved
pass
class ReviewerAgent:
"""Reviews code changes"""
def execute(self, task: Task) -> Dict:
# Analyze diff
# Check for issues
# Suggest improvements
# Approve or request changes
pass
3. Tool Layer
class CodeTools:
"""Low-level code manipulation tools"""
def parse_ast(self, code: str, language: str) -> Dict:
"""Parse code into AST"""
pass
def execute_code(self, code: str, test_input: any) -> any:
"""Execute code safely"""
pass
def run_linter(self, file_path: str) -> List[Dict]:
"""Run linter on code"""
pass
def format_code(self, code: str, language: str) -> str:
"""Format code"""
pass
def run_tests(self, test_file: str) -> Dict:
"""Run test suite"""
pass
def git_diff(self, file_path: str) -> str:
"""Get git diff"""
pass
Tool Selection
Required Tools
| Tool | Purpose | Integration |
|---|---|---|
| AST Parser | Code structure analysis | ast (Python), tree-sitter (multi-lang) |
| Static Analyzer | Bug detection | pylint, mypy, ruff |
| Code Executor | Validation | Docker sandbox |
| Test Framework | Test generation/running | pytest, unittest |
| Git Integration | Version control | GitPython |
| Vector DB | Code search | chromadb, pinecone |
| LLM API | Reasoning | OpenAI, Anthropic |
Tool Integration Strategy
class ToolRegistry:
"""Registry of available tools"""
def __init__(self):
self.tools = {
"parse_code": {
"function": self.parse_code,
"description": "Parse code into AST",
"parameters": {"code": "str", "language": "str"}
},
"run_linter": {
"function": self.run_linter,
"description": "Run static analysis",
"parameters": {"file_path": "str"}
},
"execute_code": {
"function": self.execute_code,
"description": "Execute code safely",
"parameters": {"code": "str", "timeout": "int"}
},
"run_tests": {
"function": self.run_tests,
"description": "Run test suite",
"parameters": {"test_path": "str"}
},
"search_similar_code": {
"function": self.search_similar_code,
"description": "Find similar code patterns",
"parameters": {"query": "str", "limit": "int"}
}
}
def get_tool_schemas(self) -> List[Dict]:
"""Get OpenAI function schemas"""
return [
{
"name": name,
"description": tool["description"],
"parameters": {
"type": "object",
"properties": {
param: {"type": ptype}
for param, ptype in tool["parameters"].items()
},
"required": list(tool["parameters"].keys())
}
}
for name, tool in self.tools.items()
]
Safety Considerations
Critical Safety Measures
1. Code Execution Sandbox
import docker
class SafeExecutor:
"""Execute code in isolated container"""
def __init__(self):
self.client = docker.from_env()
def execute(self, code: str, timeout: int = 30) -> Dict:
"""Execute with resource limits"""
container = self.client.containers.run(
"python:3.11-slim",
command=f"python -c '{code}'",
detach=True,
mem_limit="256m",
cpu_quota=50000,
network_disabled=True,
remove=True
)
try:
result = container.wait(timeout=timeout)
logs = container.logs().decode()
return {"success": True, "output": logs}
except:
container.kill()
return {"success": False, "error": "Timeout or error"}
2. Change Validation
class ChangeValidator:
"""Validate code changes before applying"""
def validate(self, original: str, modified: str) -> Dict:
"""Multi-level validation"""
checks = {
"syntax": self.check_syntax(modified),
"tests_pass": self.run_tests(modified),
"no_security_issues": self.check_security(modified),
"behavior_preserved": self.verify_behavior(original, modified)
}
return {
"valid": all(checks.values()),
"checks": checks
}
3. Human-in-the-Loop
class ApprovalGate:
"""Require human approval for critical changes"""
def requires_approval(self, change: Dict) -> bool:
"""Determine if change needs approval"""
critical_patterns = [
"delete", "drop", "remove",
"auth", "security", "password",
"production", "deploy"
]
return any(pattern in change["description"].lower()
for pattern in critical_patterns)
Success Metrics
Key Performance Indicators
Accuracy Metrics:
- Bug detection rate (precision/recall)
- Fix success rate (% that work)
- Test coverage achieved
- False positive rate
Efficiency Metrics:
- Time to analyze file
- Time to generate fix
- Lines of code processed per minute
- Token usage per task
Quality Metrics:
- Code quality improvement (linter score)
- Test pass rate
- User acceptance rate
- Regression rate (fixes that break things)
Measurement Strategy
class MetricsCollector:
"""Collect and track metrics"""
def __init__(self):
self.metrics = {
"bugs_detected": 0,
"fixes_applied": 0,
"fixes_successful": 0,
"tests_generated": 0,
"avg_analysis_time": [],
"user_approvals": 0,
"user_rejections": 0
}
def record_analysis(self, duration: float, bugs_found: int):
"""Record analysis metrics"""
self.metrics["avg_analysis_time"].append(duration)
self.metrics["bugs_detected"] += bugs_found
def record_fix(self, success: bool):
"""Record fix attempt"""
self.metrics["fixes_applied"] += 1
if success:
self.metrics["fixes_successful"] += 1
def get_success_rate(self) -> float:
"""Calculate fix success rate"""
if self.metrics["fixes_applied"] == 0:
return 0.0
return self.metrics["fixes_successful"] / self.metrics["fixes_applied"]
Data Flow Design
Request Processing Flow
User Request
↓
Parse Intent
↓
Create Plan (Task Decomposition)
↓
For each task:
↓
Route to Specialized Agent
↓
Execute with Tools
↓
Validate Results
↓
Store in Memory
↓
Synthesize Results
↓
Present to User
↓
Collect Feedback
↓
Update Models
State Management
from dataclasses import dataclass
from typing import Optional
import json
@dataclass
class AgentState:
"""Current state of the agent"""
current_task: Optional[Task]
task_history: List[Dict]
codebase_context: Dict
user_preferences: Dict
performance_metrics: Dict
class StateManager:
"""Manage agent state"""
def __init__(self, state_file: str = "agent_state.json"):
self.state_file = state_file
self.state = self.load_state()
def load_state(self) -> AgentState:
"""Load state from disk"""
try:
with open(self.state_file, 'r') as f:
data = json.load(f)
return AgentState(**data)
except:
return AgentState(
current_task=None,
task_history=[],
codebase_context={},
user_preferences={},
performance_metrics={}
)
def save_state(self):
"""Persist state to disk"""
with open(self.state_file, 'w') as f:
json.dump(self.state.__dict__, f, indent=2)
def update_context(self, file_path: str, analysis: Dict):
"""Update codebase context"""
self.state.codebase_context[file_path] = analysis
self.save_state()
Memory Architecture
Multi-Level Memory System
1. Working Memory: Current task context
class WorkingMemory:
"""Short-term task context"""
def __init__(self, max_size: int = 10):
self.max_size = max_size
self.items = []
def add(self, item: Dict):
"""Add to working memory"""
self.items.append(item)
if len(self.items) > self.max_size:
self.items.pop(0)
def get_context(self) -> str:
"""Get context for LLM"""
return "\n".join([
f"- {item['type']}: {item['content']}"
for item in self.items
])
2. Episodic Memory: Past tasks and solutions
class EpisodicMemory:
"""Remember past tasks"""
def __init__(self):
self.episodes = []
def store_episode(self, task: Task, solution: Dict, outcome: Dict):
"""Store completed task"""
self.episodes.append({
"task": task,
"solution": solution,
"outcome": outcome,
"timestamp": time.time()
})
def recall_similar(self, current_task: Task, limit: int = 5) -> List[Dict]:
"""Recall similar past tasks"""
# Use embedding similarity
return self.episodes[-limit:]
3. Semantic Memory: Codebase knowledge
import chromadb
class SemanticMemory:
"""Long-term codebase knowledge"""
def __init__(self):
self.client = chromadb.Client()
self.collection = self.client.create_collection("codebase")
def index_codebase(self, files: List[str]):
"""Index codebase for semantic search"""
for file_path in files:
with open(file_path, 'r') as f:
code = f.read()
self.collection.add(
documents=[code],
metadatas=[{"file_path": file_path}],
ids=[file_path]
)
def search(self, query: str, n_results: int = 5) -> List[Dict]:
"""Search for relevant code"""
results = self.collection.query(
query_texts=[query],
n_results=n_results
)
return results
Error Handling Strategy
Graceful Degradation
class RobustAgent:
"""Agent with comprehensive error handling"""
def execute_with_fallbacks(self, task: Task) -> Dict:
"""Execute with multiple fallback strategies"""
strategies = [
self.primary_strategy,
self.simplified_strategy,
self.conservative_strategy
]
for strategy in strategies:
try:
result = strategy(task)
if self.validate_result(result):
return result
except Exception as e:
self.log_error(strategy.__name__, e)
continue
return {
"success": False,
"error": "All strategies failed",
"recommendation": "Manual intervention required"
}
Design Decisions
Key Choices
1. Multi-Agent vs Single Agent
- Choice: Multi-agent with specialized roles
- Rationale: Better separation of concerns, easier to test, more maintainable
2. Synchronous vs Asynchronous
- Choice: Asynchronous for I/O operations
- Rationale: Better performance, can analyze multiple files in parallel
3. Local vs Cloud Execution
- Choice: Hybrid (local analysis, cloud LLM)
- Rationale: Security for code, power for reasoning
4. Automatic vs Interactive
- Choice: Interactive with automatic mode option
- Rationale: Safety for critical changes, speed for routine tasks
5. Learning Strategy
- Choice: Few-shot + feedback learning
- Rationale: Fast adaptation without full retraining
✅ Key Takeaways
- Design requires balancing functional and non-functional requirements
- Multi-agent architecture provides separation of concerns
- Safety mechanisms are critical for code-modifying agents
- Memory systems enable learning from past experiences
- Tool selection impacts capabilities and complexity
- Architecture decisions should align with use case constraints
Next Steps
Now that we have the design, let’s implement the Autonomous Software Engineering Agent!
In the next section, you’ll build:
- Complete working implementation
- All specialized agents
- Tool integrations
- Safety mechanisms
- Real-world examples