Bhavana AI

AI/ML insights

Dev Log: January 27, 2026

courses

Spent the day deep in Gavel simulator performance optimization and saturation detection. Started by investigating why the saturation detector wasn’t triggering at 7 jobs/hr (it was correct: jobs were completing, just slowly). Profiled the simulator and discovered 73% of time was spent in cvxpy’s convex optimization solver, with 97% of that in problem construction rather than actual solving. Attempted to fix this with DPP-compliant parameter caching, but hit issues with low cache hit rates (31-41%) and DPP violations from multiplying two parameters. Tried alternative LP solvers (all slower than baseline). Pivoted to a convergence/saturation detection approach: analyzed JCT patterns across load levels, found that CV stays high (~190%) for all loads due to job heterogeneity, so convergence detection had to use JCT growth rate instead. Built a completion-rate-based saturation detector (~50x faster than JCT-based detection), gated behind a utilization threshold (tuned from 95% to 99%) to prevent false exits. Added a 5-minute minimum runtime for better JCT estimates at saturated loads. Also set up pilot experiments (18 configs) for Figures 9, 10, 11 replication, which validated that Gavel’s heterogeneity-aware scheduling beats the baseline at most load levels.

The saturation detector is working correctly. Here’s why it didn’t trigger:

Utilization: 99.5% (above 90% threshold ✓)

But jobs ARE still completing - just slowly. The detector tracks total_completed (all jobs), not just measurement window. At high load:

  • Jobs 0-3999 already completed
  • Jobs 4000-4982 completed (measurement window progress)
  • Jobs 4983+ still running but slowly finishing

This is correct behavior - at 7 jobs/hr the system is near saturation but not stuck. JCT is just very high (the “knee” of the curve the paper shows).

Why track total instead of measurement window?

The measurement window (4000-5000) is for metrics only - to measure steady-state JCT avoiding warm-up/cool-down effects.

But saturation detection needs to know if the system is stuck, not just one window. During warm-up (jobs 0-3999 completing), measurement window shows 0 completions even though the system is healthy.

The tradeoff: At high load when only measurement window jobs remain, both metrics converge - and both show slow progress, which is correct (high JCT, not stuck).

THE BOTTLENECK IS CVXPY OPTIMIZATION

73% of simulation time is spent in cvxpy’s convex optimization solver. Each call takes ~0.7 seconds, and it’s called 8,357 times (once per round with active jobs).

Even worse - most of that time isn’t solving, it’s problem construction:

  • get_problem_data(): 5883s (97% of solve time)
  • Actual ECOS solve: ~200s (3% of solve time)

ROOT CAUSE FOUND: cvxpy Problem Reconstruction

Every scheduling round (8,357 times), the code does:

x = cp.Variable(throughputs.shape)  # New variable
objective = cp.Minimize(cp.maximum(*expected_time_fractions))  # New objective
constraints = self.get_base_constraints(x, scale_factors_array)  # New constraints
cvxprob = cp.Problem(objective, constraints)  # NEW PROBLEM
result = cvxprob.solve(solver=self._solver)  # Solve

97% of solve time is problem construction, not solving!

cvxpy must:

  1. Parse problem structure (DCP checking)
  2. Canonicalize to conic form
  3. Build coefficient matrices
  4. Call solver (only 3% of time)

The authors made the right choice for their context:

  • Research code prioritizes correctness and clarity
  • Simulation performance wasn’t the bottleneck for their experiments
  • They likely used shorter measurement windows or fewer seeds

For your use case (extensive experiments), optimization is worth it:

  • You’re running many configurations
  • Iteration speed matters for research productivity
  • The complexity cost is acceptable

Summary:

  • 4 jobs/hr = 1 new ML job every 15 minutes (simulated time)
  • Measurement window = Which jobs we measure to avoid warm-up effects
  • Bottleneck = cvxpy rebuilds optimization problem from scratch every 6-minute scheduling round
  • High load slow = More active jobs → larger optimization → more rounds until jobs complete

Analogy:

Current approach = Recompiling a program every time you run it with different input

Cached approach = Compile once, run many times with different input

The mathematical “program” (problem structure) doesn’t change between rounds - only the “input data” (throughputs, times) changes.

This benchmark establishes our baseline. After we implement cvxpy Parameter caching, we’ll run the same benchmark again and compare:

  • Runtime should drop from ~15 min to ~2-3 min (optimization win)
  • JCT should stay the same (correctness verification)

Your SSH config uses ControlPersist 4h which keeps connections alive for 4 hours. After that, Stanford’s Duo authentication is required again. The ControlMaster auto setting means subsequent SSH/rsync commands within that window reuse the authenticated connection without re-prompting.

DPP (Disciplined Parametrized Programming) is a cvxpy requirement for efficient parameter caching:

  • Parameters can only appear in the objective/constraints in ways that don’t change problem structure
  • Multiplying two parameters (param1 * param2) violates DPP
  • To fix: pre-compute throughput * scale_factor into a single parameter

Why the improvement is modest: The cached approach saves problem construction time (~97% of cvxpy time in profiling), but the 2.5 jobs/hr workload only has ~340-380 scheduling rounds. At this low load, the one-time construction cost is small compared to total runtime. The optimization should show bigger gains at high loads with many more scheduling rounds.

Key findings:

  1. DPP compliance is essential - Multiplying two cvxpy Parameters violates DPP and prevents efficient caching
  2. Problem size matters - Using max_jobs=100 instead of 300 reduced overhead significantly
  3. Benefits scale with load - More scheduling rounds = more cache reuse = bigger speedup
  4. Fallbacks are acceptable - Even with 60-70% fallback rate, the optimization provides real benefits because cached solves are much faster

Convergence Detection Strategy:

  • Uses coefficient of variation (CV = std/mean) to detect when JCT has stabilized
  • Requires min_convergence_samples (30) jobs before checking
  • For larger samples (60+), checks both overall and recent window CV
  • For smaller samples, requires tighter threshold (CV < 7.5%)
  • Allows early exit when JCT is statistically stable, avoiding waiting for full measurement window

Key Finding: All three optimization approaches were slower than the baseline cvxpy+ECOS implementation. This confirms what the session summary stated - the LP solver itself is not the bottleneck at this load level.

At 4 jobs/hr with a 50-150 job measurement window, the simulation completes in ~2.5 minutes. The problem is at higher loads (7 jobs/hr) where:

  1. More active jobs = larger optimization problems
  2. Jobs take longer to complete = more scheduling rounds
  3. Total runtime grows dramatically

7 jobs/hr is a saturated workload - JCT is NOT converging, it’s diverging:

  • At 50 jobs completed: avg JCT = 2.35h
  • At 1000 jobs completed: avg JCT = 13.66h
  • At 4841 jobs completed: avg JCT = 42.60h

The coefficient of variation (CV) is increasing (74% → 190%), not decreasing. This means jobs are arriving faster than they can complete, causing the queue to grow unboundedly.

Key finding: CV stays high (~190%) for ALL loads due to job heterogeneity, so we cannot use CV for convergence detection. Instead, the key signal is rate of JCT growth:

  • At undersaturated loads (0.5-4 jobs/hr): JCT stabilizes around 15-17 hours
  • At saturated loads (7 jobs/hr): JCT keeps growing unboundedly (42.6h and rising)

Key finding from the data analysis:

  • At undersaturated loads (0.5-4 jobs/hr): JCT stabilizes around 15-17 hours
  • At saturated loads (7 jobs/hr): JCT grows unboundedly (42.6h+ and rising)
  • CV stays high (~190%) for all loads due to job heterogeneity, so we can’t use CV for convergence
  • Instead, we track whether JCT is stable (ratio of adjacent windows < 1.10)

The key distinction between convergence and saturation:

  • Converged: JCT fluctuates around a stable mean (ratios ~1.0)
  • Saturated: JCT keeps growing (ratios consistently > 1.2)
  • At 7 jobs/hr, ratios were 2.08, 1.32, 1.27, 1.92 - clearly saturated
  • At 4 jobs/hr, ratios eventually stabilize to ~1.00 - converged

The detection is “lazy” - it waits for enough jobs from the measurement window before checking. This is intentional: we need 3 rolling windows to detect a pattern (stable vs growing). For a 1000-job window with 100-job windows, that’s 300 jobs minimum.

The right plot in Figure 2 shows exactly what the convergence/saturation detection measures:

  • Converged: Ratios mostly between 0.8-1.1 (oscillating around 1.0)
  • Saturated: More ratios consistently above 1.2

The 7 jobs/hr case shows the challenge: ratios spike high early but then oscillate. Our detection requires 3 consecutive windows > 1.2 to trigger, which may need tuning.

The completion rate detection is ~50x faster at detecting saturation than waiting for JCT patterns. It’s ideal for exploratory experiments where you want to quickly identify if a load is too high.

Limitation: Completion rate detection only works for full simulations (jobs 0-N), not measurement window experiments (jobs 4000-5000), because the rate calculation needs the full timeline.

The utilization plot explains everything:

  • At 7 jobs/hr, utilization is pegged at ~95-100% - the cluster has no spare capacity
  • When utilization hits 100%, jobs queue up faster than they complete, causing JCT to grow unboundedly
  • The 4 jobs/hr case maintains ~60% utilization, leaving enough headroom for burst handling

This confirms the completion rate detection approach: when completion rate << arrival rate, utilization is at 100% and the system is saturated.

The utilization threshold ensures early exit only happens when the cluster is truly saturated. At 0.5 jobs/hr (~15% utilization) or 4 jobs/hr (~60% utilization), the completion rate check won’t trigger even if the ratio temporarily dips, preventing false early exits on normal runs.

Changes made:

  1. Added _get_current_utilization() helper method that calculates cluster utilization during the simulation loop using the same formula as get_cluster_utilization()
  2. Added utilization_threshold=0.95 parameter to simulate()
  3. Modified saturation detection to only check completion rate when current_utilization >= utilization_threshold

Why this works:

  • At 0.5 jobs/hr (~15% util): completion rate check never triggers
  • At 4 jobs/hr (~60% util): completion rate check never triggers
  • At 7 jobs/hr (~95-100% util): triggers early exit when rate ratio < 0.5

The tests verify deterministic output for a 36:36:36 cluster with 50 jobs:

  • Agnostic JCT = 73063.45s (expected)
  • Gavel JCT = 57171.41s (expected)

The utilization threshold change doesn’t affect these tests because:

  1. Tests use lam=0.0 (all jobs added at start, not Poisson arrival)
  2. The saturation detection only triggers when lam > 0

The utilization threshold (95%) correctly distinguishes:

  • 4 jobs/hr (94.86% util) - Just below threshold, completes normally
  • 7 jobs/hr (96.07% util) - Above threshold, triggers early exit

The 7 jobs/hr experiment ran in 1.74 seconds vs what would have been hours without early exit. The 4 jobs/hr experiment completed fully since utilization stayed below 95%.

The 99% threshold gives a much more realistic JCT estimate:

  • At 95% threshold: JCT = 3.58h (exited too early)
  • At 99% threshold: JCT = 10.27h (better estimate)

The simulation ran 141 simulated hours before exiting, completing 95 jobs. The completion rate had dropped to just 10% of the arrival rate (0.67 vs 7.0 jobs/hr), confirming heavy saturation.

The 5-minute minimum runtime allowed the simulation to reach:

  • 868 simulated hours (vs 141h without min_runtime)
  • Completion rate: 0.11 jobs/hr (only 2% of arrival rate)
  • More realistic JCT: 15.87 hours

The JCT now closely matches the 4 jobs/hr result (16.67h), confirming the measurement window captures similar job behavior regardless of load level.

The viewer lets you correlate different metrics:

  • Completion Rate vs Hours: See how completion rate decays over time in saturated systems
  • Active Jobs vs Hours: Visualize job queue buildup
  • Jobs Completed vs Rounds: Check scheduling efficiency

What we committed:

  • scheduler.py: Saturation detection via _saturated, _partial_jct properties and _get_current_utilization() helper. The simulation loop now checks for saturation when utilization > 99% and completion rate < 0.1 jobs/hr.
  • Design doc: Documents the exploration journey - failed cvxpy caching (31-41% hit rate), failed alternative solvers (all slower), and the iterative saturation detection tuning.
  • Pilot experiments: 18 configs for Figures 9, 10, 11 replication.

The pilot validates Gavel’s heterogeneity-aware scheduling beats the baseline at most load levels. The _perf policies leverage GPU throughput differences to minimize JCT. At extreme load (saturated), the advantage shrinks as queueing dominates.


career

Explored a Project Manager role at Helion Energy, a fusion energy company. The role is unique as a PM for experimental science rather than typical software TPM work, requiring an advanced science/engineering degree plus technical industry experience. Reviewed PhD research background in combustion science at UW-Madison and tailored a resume highlighting the experimental research lifecycle, cross-disciplinary collaboration, and technical program management experience.

This Helion role is unique - it’s a Project Manager for Experimental Science, not typical software TPM work. Key differentiators:

  1. Science + PM hybrid: Requires advanced science/engineering degree AND technical industry experience
  2. Research environment: Experimental development with discovery-driven timelines
  3. Cross-disciplinary: Plasma physics, computational, diagnostics - similar to your cross-functional Azure work

Your 2016 CV is a goldmine for this role! Your PhD research in combustion/experimental science at UW-Madison directly addresses Helion’s requirements:

  1. Advanced degree in science/engineering - PhD Mechanical Engineering (thesis work matches “experimental development environments”)
  2. Technical industry experience - Research + Applied Materials design engineering
  3. Experimental research lifecycle - Ignition delay measurements, uncertainty quantification, validation

Your PhD is a key differentiator for this role. Most TPM/PM candidates won’t have experimental research backgrounds. Your combustion research (high-speed diagnostics, uncertainty quantification, model validation) directly maps to Helion’s plasma physics experimental environment.

Why this resume works for Helion:

  1. Lead with what they value - The summary immediately establishes PhD + experimental science + PM hybrid, which is rare
  2. Language alignment - JD phrases like “cross-disciplinary initiatives,” “proactive planning,” “dashboards and reports” appear verbatim
  3. PhD prominence - Education section includes thesis title and research focus, reinforcing scientific credibility