Dev Log: January 26, 2026

courses

A deep study day covering distributed consensus (Raft, Paxos, ZooKeeper) and GPU cluster scheduling (Gavel). Worked through the distinctions between 2PC and consensus protocols, Raft’s election and log replication mechanisms, and membership changes via joint consensus. On the Gavel side, ran simulation experiments across multiple load levels, profiled the simulator to identify bottlenecks, fixed a bug in job completion time aggregation, and implemented saturation detection for stuck experiments.

2PC and consensus solve different problems: 2PC is for atomic transactions, consensus is for replicated agreement
ZooKeeper is a user of consensus, not an alternative to it
The reason Raft matters: before it, implementing consensus meant wrestling with Paxos, which even Google’s Chubby team called “an unproven protocol” after trying to implement it

The service layer (etcd, ZooKeeper) is what applications talk to
The consensus layer (Raft, Zab, Paxos) is an implementation detail hidden inside
You could theoretically swap Zab for Raft inside ZooKeeper and applications wouldn’t notice

The algorithm is deterministic, but the workload is not.

The random seed controls the synthetic trace generation, not the scheduler itself. Here’s what’s randomized:

Component	What the seed controls
Job arrival times	Poisson process: when each job enters the system
Job types	Which of the 26 model types (ResNet-50, Transformer, etc.) each job is
Job durations	Sampled from 10^x distribution (how long each job needs to train)
Job scale (Fig 10-11)	How many GPUs each job requests (1, 2-4, or 8)

Why this matters:

Same seed = identical workload - If you run FIFO and Gavel with seed=42, they see the exact same jobs arriving at the exact same times. This makes comparison fair.
Different seeds = different workloads - Seed=0 might generate a burst of expensive 8-GPU jobs early, while seed=1 spreads them out. This changes results.
Error bars show robustness - If Gavel beats FIFO across all seeds, the result is robust. If it only wins on some seeds, the workload characteristics matter more than the algorithm.

The simulation has three main performance bottlenecks:

Policy Allocation (cvxpy solver) - Called once per scheduling round to compute optimal job-to-GPU allocation
Job Scheduling - Maps allocation fractions to actual worker assignments
State Management - Tracking job progress, completions, arrivals

The cvxpy solver is almost certainly the bottleneck since it solves a convex optimization problem.

cProfile is Python’s built-in deterministic profiler:

How it works: Intercepts every function call/return and records timing
Overhead: Typically 10-30% slowdown (varies by call frequency)
Output: Function-level stats: total time, cumulative time, call count

The overhead is acceptable for understanding bottlenecks but not for final benchmarks.

Election timeout randomization is what prevents infinite split votes - simple but crucial
Terms are like version numbers that let servers detect when they have stale information
Majority quorum means you can lose up to (N-1)/2 servers and keep running

The existing config has a subtle issue: lambda in the code is inter-arrival time in seconds, not jobs/hr. The conversion is lambda = 3600 / jobs_per_hr. So 4 jobs/hr = 900 seconds between arrivals.

I spotted an issue: time_per_iteration=lam is wrong. time_per_iteration is the scheduling round duration (paper uses 360 seconds = 6 minutes), not the inter-arrival time. The lam is passed separately to simulate().

The log files are massive (100+ MB each) due to verbose DEBUG logging. The simulations are actively processing jobs (currently around job 2000+). This verbose output will make cProfile results hard to read but won’t affect the timing data.

Raft trades scalability for simplicity - it’s not meant for 100+ node clusters
Real large-scale systems use many small Raft groups (sharding), not one large group
The randomized timeout is “good enough” for small clusters, not a perfect solution

The prevLogIndex/prevLogTerm check is like a hash chain - if entry N matches, all entries before N must also match
Leaders never delete their own entries - they only append. Followers conform to leader.
Uncommitted entries can be lost - this is by design. Only committed entries are durable.

At 1.5 jobs/hr (Fig 10), LAS shows 17.65 hr vs FIFO’s 23.66 hr - a 1.34x improvement. This aligns with the paper’s claims that Gavel outperforms FIFO at higher loads.

Key observations:

At low load (0.5-1.0 jobs/hr), all policies perform similarly - the cluster isn’t congested
At high load (2.5 jobs/hr), FIFO’s JCT explodes to 210 hours while LAS stays at 31 hours - this matches the paper’s claim of Gavel preventing starvation
The utilization at 2.5 jobs/hr is ~96% for both policies - the difference is in how fairly time is allocated

The election restriction is what makes Raft safe - candidates must have all committed entries to win
Quorum intersection: If entry E is committed (on majority M1), and candidate C needs majority M2, then M1 ∩ M2 ≠ ∅. Someone in M2 has E and will deny C if C lacks E.
This is why Raft needs a majority (not just “some” servers) - it guarantees overlap

Just being on a majority isn’t enough - the entry must be “locked in” by a current-term commit
This is subtle but prevents a real data loss scenario
Practical implication: new leaders often write a no-op entry immediately to commit pending entries

Joint consensus is the key: require agreement from both old AND new configs during transition
This ensures there’s never a moment where two independent majorities can form
Membership changes are treated as log entries - they use the same replication mechanism as data

The bug occurred because _job_completion_times dictionary can have None values for jobs that were marked as completed but whose completion time wasn’t properly recorded. The sum() function fails when trying to add None to floats. Always guard against None values in list comprehensions when aggregating data.

Saturation detection vs JCT threshold:

JCT threshold was reactive - waited until jobs took too long
Saturation detection is proactive - detects when no progress is being made
50 rounds × 6 min/round = 5 hours simulated time with no progress triggers exit
This catches experiments stuck at 998/1000 much faster

Raft’s value is in clarity, not power - a rare and valuable contribution
The paper succeeds by being complete enough to implement without filling in gaps
Real-world validation (etcd, Consul, etc.) is ultimately the strongest evidence