Performance Benchmarking Strategy for Large Monorepos
Status: π Investigating
Proposed: December 2026
Goal: Establish comprehensive performance benchmarks to validate Atomic's scalability with large monorepos and thousands of concurrent developers
Problem Statementβ
Based on the mathematical analysis in Hunks: Edit and Replacement Calculations, Atomic faces several performance challenges with large-scale usage:
- Context Calculation: O(contexts Γ edges) complexity per hunk
- Write Amplification: 2-5Γ overhead from copy-on-write B-trees
- Transaction Commits: 1-10ms overhead per change
- Query Performance: O(changes) for file existence/content lookups
These bottlenecks become critical when scaling to:
- Large monorepos (10,000+ files, 100GB+ codebase)
- High change velocity (1000+ changes per day)
- Many concurrent developers (1000+ active users)
- AI agent swarms (100+ agents generating changes simultaneously)
Testing Philosophyβ
Don't just test file countsβtest the actual bottlenecks.
Rather than simply creating X files with Y changes, we need to simulate real-world usage patterns that stress the specific algorithmic bottlenecks:
- Context calculation stress: Many small changes with deep dependency chains
- Write amplification stress: High-frequency small commits
- Query performance stress: Frequent file existence/content lookups
- Concurrency stress: Many developers working simultaneously
Benchmark Scenariosβ
Scenario 1: Context Calculation Stress Testβ
Goal: Measure O(contexts Γ edges) complexity impact
Setup:
- Repository: 1,000 files across 50 directories (initial baseline)
- Changes: 100,000 changes in sequential dependency chain (each depends on previous)
- Pattern: Each change modifies the same file sequentially, appending one line
- Dependencies: Each change creates context dependencies on previous changes
Metrics:
- Time to record each change (context calculation overhead)
- Throughput (changes per second)
- Performance degradation as dependency chain grows
- Memory usage during change recording
Results (Initial Benchmark - January 2026):
- Constant time per change: ~500Β΅s regardless of chain length (10 to 100,000 changes)
- Stable throughput: ~1,680-1,690 changes/sec maintained throughout
- No degradation observed: Performance remains constant as dependency chain grows
- Total time: ~60 seconds for 100,000 changes
Why this matters: Tests the worst-case scenario for context calculation. The initial results show excellent scalability - context calculation overhead remains constant even with very long dependency chains, suggesting the implementation effectively avoids quadratic complexity in practice.
Scenario 2: Write Amplification Stress Testβ
Goal: Measure COW B-tree overhead (2-5Γ write amplification)
Setup:
- Repository: 10,000 files
- Changes: 50,000 small changes (1-2 hunks each)
- Pattern: Each change commits immediately (no batching)
- Frequency: 100 changes per second (simulated burst)
Metrics:
- Disk writes per change (write amplification ratio)
- Time per commit (transaction overhead)
- Database size vs. logical data size
- I/O throughput (MB/s)
Why this matters: Tests Sanakirja's copy-on-write overhead, which can be 2-5Γ for small writes.
Scenario 3: Query Performance Stress Testβ
Goal: Measure O(changes) query degradation
Setup:
- Repository: 5,000 files across 100 directories
- Changes: 100,000 changes over time
- Queries: 10,000 random file existence/content lookups
- Pattern: Query files at different points in history (recent vs. old)
Metrics:
- Query latency by change count (1K, 10K, 50K, 100K changes)
- Time to query recent files vs. old files
- Cache hit rates (if applicable)
- Database page access patterns
Why this matters: Validates the need for Manifest nodes (O(1) queries vs. O(changes)).
Scenario 4: Concurrent Developer Simulationβ
Goal: Measure contention and coordination overhead
Setup:
- Repository: Shared monorepo with 10,000 files
- Developers: 100 concurrent "developers" (processes/threads)
- Pattern: Each developer makes 100 changes over 1 hour
- Coordination: All developers work on different files (no conflicts)
Metrics:
- Throughput (changes per second across all developers)
- Latency per developer (p50, p95, p99)
- Database lock contention
- Memory usage under load
- Transaction abort rate
Why this matters: Tests real-world multi-developer scenarios with potential contention.
Scenario 5: AI Agent Swarm Simulationβ
Goal: Measure AI agent parallel change generation
Setup:
- Repository: 5,000 files
- Agents: 50 concurrent agents
- Pattern: Each agent generates 200 changes (small, incremental)
- Dependency pattern: Agents create independent change stacks
Metrics:
- Total throughput (changes per second)
- Average change size (hunks per change)
- Memory usage per agent (virtual working copy efficiency)
- Change deduplication rate (content-addressed deduplication benefit)
Why this matters: Tests Atomic's key differentiatorβparallel agent swarms with commutative operations.
Scenario 6: Real-World Monorepo Simulationβ
Goal: Simulate actual large company monorepo patterns
Setup: Based on Meta/Google monorepo characteristics:
- Files: 100,000 files across 500 directories
- Changes: 10,000 changes per day (realistic for large company)
- Change size: 50% small (1-5 files), 30% medium (5-20 files), 20% large (20-100 files)
- Dependencies: 20% have dependencies, 80% independent
- Tags: Create consolidating tags every 100 changes
- Duration: 30 days of changes (300,000 total changes)
Metrics:
- Daily change throughput
- Repository size growth (database size)
- Clone time (fresh clone of 30-day history)
- Common operations (log, diff, apply) latency
- Memory usage patterns
Why this matters: Most realistic test, simulates actual usage patterns from large companies.
Benchmark Implementation Strategyβ
Phase 1: Synthetic Load Generationβ
Create a benchmark harness that generates synthetic but realistic load:
// libatomic/tests/benchmarks/large_repo.rs
pub struct BenchmarkRepo {
files: Vec<FileSpec>,
change_pattern: ChangePattern,
dependencies: DependencyPattern,
}
pub enum ChangePattern {
Sequential, // Each change depends on previous
Independent, // All changes independent
Stacked, // Changes form dependency stacks
Mixed, // Combination of patterns
}
pub struct BenchmarkResult {
change_count: u64,
total_time: Duration,
avg_time_per_change: Duration,
p95_time: Duration,
p99_time: Duration,
database_size: u64,
memory_peak: u64,
write_amplification: f64,
}
Phase 2: Real-World Pattern Simulationβ
Extract realistic change patterns from actual large codebases to inform benchmark design:
-
Analyze large codebases: Study change patterns from existing large repositories (e.g., Meta, Google monorepos) to understand:
- Change frequency and size distributions
- File modification patterns (how many files per change)
- Dependency patterns (how changes relate to each other)
- Developer workflow patterns (feature development, hotfixes, refactoring)
-
Pattern library: Build library of common patterns based on real-world analysis:
- Feature branch patterns (sequential, stacked changes)
- Hotfix patterns (independent, fast changes)
- Refactoring patterns (many files, deep dependencies)
- AI agent patterns (many small, independent changes)
Phase 3: Continuous Benchmarkingβ
Integrate benchmarks into CI/CD:
# Run benchmarks on every PR
cargo bench --bench large_repo
# Compare against baseline
# Fail if performance degrades >10%
Specific Test Harness Proposalβ
Based on your suggestion, but enhanced to target bottlenecks:
Test Repository Structureβ
large-monorepo-benchmark/
βββ src/
β βββ module-1/ (40 files)
β βββ module-2/ (40 files)
β βββ ...
β βββ module-25/ (40 files) # 25 modules Γ 40 files = 1,000 files
βββ tests/
β βββ module-1/ (10 test files per module)
β βββ ...
βββ docs/
βββ ...
Total: 1,000 source files + 250 test files = 1,250 files
Change Generation Strategyβ
Not just 50 changes per fileβinstead, generate changes that stress specific bottlenecks:
Pattern A: Deep Dependency Chainsβ
// Each change depends on previous
for i in 0..1000 {
let change = create_change(
files: random_files(1..5),
dependencies: vec![previous_change_hash],
);
record_change(change);
}
Tests: Context calculation with deep chains
Pattern B: High-Frequency Small Changesβ
// Many small commits (simulates AI agents)
for _ in 0..50000 {
let change = create_change(
files: random_files(1..2),
hunks_per_file: 1..3,
dependencies: vec![], // Independent
);
record_and_commit(change); // Immediate commit
}
Tests: Write amplification and transaction overhead
Pattern C: Query Performance Regressionβ
// Build up history, then query
for i in 0..100000 {
record_change(create_random_change());
if i % 1000 == 0 {
benchmark_query_performance(); // Query 100 random files
}
}
Tests: O(changes) query degradation
Pattern D: Concurrent Developersβ
// 100 parallel "developers"
let handles: Vec<_> = (0..100).map(|dev_id| {
thread::spawn(move || {
for _ in 0..100 {
let change = create_change(dev_id, random_files(1..10));
record_change(change);
}
})
}).collect();
Tests: Concurrency and contention
Metrics to Collectβ
Primary Metricsβ
-
Change Recording Time
- Average, p50, p95, p99 latencies
- Breakdown: context calculation, database write, commit
-
Database Performance
- Write amplification ratio (actual writes / logical writes)
- Transaction commit time
- B-tree depth and page splits
- Cache hit rates
-
Query Performance
- File existence query latency (by change count)
- File content query latency (by change count)
- History traversal time (log, diff operations)
-
Memory Usage
- Peak memory during operations
- Memory per change (virtual working copy efficiency)
- Database memory mapping overhead
-
Scalability Curves
- Performance vs. change count (1K to 10K to 100K to 1M)
- Performance vs. file count (100 to 1K to 10K to 100K)
- Performance vs. dependency depth (1 to 10 to 100 to 1000)
Secondary Metricsβ
-
Content-Addressed Deduplication
- Deduplication rate (identical changes detected)
- Storage savings from deduplication
-
Tag Consolidation Impact
- Dependency count before/after tags
- Query performance with/without tags
-
Concurrency Metrics
- Throughput (changes/second) vs. concurrent users
- Lock contention rate
- Transaction abort rate
Success Criteriaβ
Baseline Targets (Current Implementation)β
| Scenario | Metric | Target | Status |
|---|---|---|---|
| Small Repo (1K files, 1K changes) | Change record | <100ms | TBD |
| Medium Repo (10K files, 10K changes) | Change record | <500ms | TBD |
| Large Repo (100K files, 100K changes) | Change record | <2s | TBD |
| Query Performance (100K changes) | File existence | <10ms | TBD |
| Concurrent (100 developers) | Throughput | >10 changes/s | TBD |
Optimization Targets (With Manifest Nodes)β
| Scenario | Metric | Target | Improvement |
|---|---|---|---|
| Query Performance | File existence | <1ms | 10Γ faster |
| Large Repo Query | File content | <5ms | 100Γ faster |
| Write Amplification | Write ratio | <2Γ | 50% reduction |
Implementation Planβ
Phase 1: Basic Benchmark Framework (Week 1)β
-
Create benchmark harness
- File generation utilities
- Change generation patterns
- Metrics collection system
-
Implement Scenario 1-2
- Context calculation stress test
- Write amplification stress test
-
Baseline measurements
- Run benchmarks on current implementation
- Document baseline performance
Phase 2: Comprehensive Testing (Week 2)β
-
Implement Scenario 3-6
- Query performance tests
- Concurrency tests
- Real-world simulation
-
Continuous benchmarking
- Integrate into CI/CD
- Performance regression detection
Phase 3: Optimization & Validation (Ongoing)β
-
Implement optimizations
- Batch transactions
- Manifest nodes (if needed)
- Context caching
-
Validate improvements
- Re-run benchmarks
- Compare against targets
Comparison with Your Proposalβ
Your Proposal (Good Starting Point)β
- β 1,000 files (tests file count scaling)
- β 25 folders (tests directory structure)
- β 50 changes per file (50,000 total changes)
Enhanced Proposal (Targets Bottlenecks)β
- β Same file structure (realistic)
- β
Multiple change patterns (not just 50 per file):
- Deep dependency chains (context stress)
- High-frequency commits (write amplification)
- Query performance regression (O(changes) degradation)
- Concurrent developers (contention)
- β Real-world simulation (actual usage patterns)
- β Comprehensive metrics (target specific bottlenecks)
Next Stepsβ
- Create benchmark harness in
libatomic/tests/benchmarks/ - Implement Scenario 1 (Context Calculation Stress Test)
- Run baseline measurements on current implementation
- Identify bottlenecks from actual data
- Prioritize optimizations based on benchmark results
Referencesβ
- Hunks: Edit and Replacement Calculations - Mathematical foundations and complexity analysis
- Manifest Nodes Proposal - Proposed optimization for query performance
- Virtual Working Copies - AI agent optimization strategy