Dev Log: January 26, 2026
courses
A deep study day covering distributed consensus (Raft, Paxos, ZooKeeper) and GPU cluster scheduling (Gavel). Worked through the distinctions between 2PC and consensus protocols, Raft’s election and log replication mechanisms, and membership changes via joint consensus. On the Gavel side, ran simulation experiments across multiple load levels, profiled the simulator to identify bottlenecks, fixed a bug in job completion time aggregation, and implemented saturation detection for stuck experiments.
- 2PC and consensus solve different problems: 2PC is for atomic transactions, consensus is for replicated agreement
- ZooKeeper is a user of consensus, not an alternative to it
- The reason Raft matters: before it, implementing consensus meant wrestling with Paxos, which even Google’s Chubby team called “an unproven protocol” after trying to implement it
- The service layer (etcd, ZooKeeper) is what applications talk to
- The consensus layer (Raft, Zab, Paxos) is an implementation detail hidden inside
- You could theoretically swap Zab for Raft inside ZooKeeper and applications wouldn’t notice
The algorithm is deterministic, but the workload is not.
The random seed controls the synthetic trace generation, not the scheduler itself. Here’s what’s randomized:
| Component | What the seed controls |
|---|---|
| Job arrival times | Poisson process: when each job enters the system |
| Job types | Which of the 26 model types (ResNet-50, Transformer, etc.) each job is |
| Job durations | Sampled from 10^x distribution (how long each job needs to train) |
| Job scale (Fig 10-11) | How many GPUs each job requests (1, 2-4, or 8) |
Why this matters:
-
Same seed = identical workload - If you run FIFO and Gavel with seed=42, they see the exact same jobs arriving at the exact same times. This makes comparison fair.
-
Different seeds = different workloads - Seed=0 might generate a burst of expensive 8-GPU jobs early, while seed=1 spreads them out. This changes results.
-
Error bars show robustness - If Gavel beats FIFO across all seeds, the result is robust. If it only wins on some seeds, the workload characteristics matter more than the algorithm.
The simulation has three main performance bottlenecks:
- Policy Allocation (cvxpy solver) - Called once per scheduling round to compute optimal job-to-GPU allocation
- Job Scheduling - Maps allocation fractions to actual worker assignments
- State Management - Tracking job progress, completions, arrivals
The cvxpy solver is almost certainly the bottleneck since it solves a convex optimization problem.
cProfile is Python’s built-in deterministic profiler:
- How it works: Intercepts every function call/return and records timing
- Overhead: Typically 10-30% slowdown (varies by call frequency)
- Output: Function-level stats: total time, cumulative time, call count
The overhead is acceptable for understanding bottlenecks but not for final benchmarks.
- Election timeout randomization is what prevents infinite split votes - simple but crucial
- Terms are like version numbers that let servers detect when they have stale information
- Majority quorum means you can lose up to (N-1)/2 servers and keep running
The existing config has a subtle issue: lambda in the code is inter-arrival time in seconds, not jobs/hr. The conversion is lambda = 3600 / jobs_per_hr. So 4 jobs/hr = 900 seconds between arrivals.
I spotted an issue: time_per_iteration=lam is wrong. time_per_iteration is the scheduling round duration (paper uses 360 seconds = 6 minutes), not the inter-arrival time. The lam is passed separately to simulate().
The log files are massive (100+ MB each) due to verbose DEBUG logging. The simulations are actively processing jobs (currently around job 2000+). This verbose output will make cProfile results hard to read but won’t affect the timing data.
- Raft trades scalability for simplicity - it’s not meant for 100+ node clusters
- Real large-scale systems use many small Raft groups (sharding), not one large group
- The randomized timeout is “good enough” for small clusters, not a perfect solution
- The
prevLogIndex/prevLogTermcheck is like a hash chain - if entry N matches, all entries before N must also match - Leaders never delete their own entries - they only append. Followers conform to leader.
- Uncommitted entries can be lost - this is by design. Only committed entries are durable.
At 1.5 jobs/hr (Fig 10), LAS shows 17.65 hr vs FIFO’s 23.66 hr - a 1.34x improvement. This aligns with the paper’s claims that Gavel outperforms FIFO at higher loads.
Key observations:
- At low load (0.5-1.0 jobs/hr), all policies perform similarly - the cluster isn’t congested
- At high load (2.5 jobs/hr), FIFO’s JCT explodes to 210 hours while LAS stays at 31 hours - this matches the paper’s claim of Gavel preventing starvation
- The utilization at 2.5 jobs/hr is ~96% for both policies - the difference is in how fairly time is allocated
- The election restriction is what makes Raft safe - candidates must have all committed entries to win
- Quorum intersection: If entry E is committed (on majority M1), and candidate C needs majority M2, then M1 ∩ M2 ≠ ∅. Someone in M2 has E and will deny C if C lacks E.
- This is why Raft needs a majority (not just “some” servers) - it guarantees overlap
- Just being on a majority isn’t enough - the entry must be “locked in” by a current-term commit
- This is subtle but prevents a real data loss scenario
- Practical implication: new leaders often write a no-op entry immediately to commit pending entries
- Joint consensus is the key: require agreement from both old AND new configs during transition
- This ensures there’s never a moment where two independent majorities can form
- Membership changes are treated as log entries - they use the same replication mechanism as data
The bug occurred because _job_completion_times dictionary can have None values for jobs that were marked as completed but whose completion time wasn’t properly recorded. The sum() function fails when trying to add None to floats. Always guard against None values in list comprehensions when aggregating data.
Saturation detection vs JCT threshold:
- JCT threshold was reactive - waited until jobs took too long
- Saturation detection is proactive - detects when no progress is being made
- 50 rounds × 6 min/round = 5 hours simulated time with no progress triggers exit
- This catches experiments stuck at 998/1000 much faster
- Raft’s value is in clarity, not power - a rare and valuable contribution
- The paper succeeds by being complete enough to implement without filling in gaps
- Real-world validation (etcd, Consul, etc.) is ultimately the strongest evidence