Bhavana AI

AI/ML insights

Dev Log: February 8, 2026

Personal Finance

A deep debugging session uncovered why several paycheck deductions (401k Roth, HSA, DCFSA) were silently returning null early in the year before W-2s are available. The root cause was a “W-2 first, paystub never” fallback pattern that only covered a subset of deduction types. The fix ensures every deduction type has a paystub lookup path, not just the ones that are paystub-only. From there, the work expanded into contribution limit enforcement, making the budget projection pipeline fully data-driven from the per-year tax config, and fixing several YNAB integration bugs around category group assumptions.

The root cause was that the deductions code had a W-2 first, paystub never pattern for 401k Roth, HSA, and DCFSA. These items appear on both documents (W-2 box12 codes and paystub line items), but the fallback path only handled After-Tax 401k, ESPP, and Insurance. Early in the year, with no W-2 yet, those three deductions silently stayed null. The fix adds paystub lookup for every deduction type, not just the ones that are paystub-only.

The key architectural choice: limits live in the same per-year tax config JSON (tax_2026.json) that already stores brackets and SS wage base. When you run pf tax fetch at the start of a new year, you’d add the contribution limits at the same time. Every downstream script — budget-flow, budget-webapp, the CLI — just reads from that single config file. No hardcoded limits anywhere in the logic; the budget-flow.ts capping loop is fully data-driven from the config.

The provenance chain now shows the cap too: Actual -> x 13.0 -> cap: 401k Roth limit (IRS 2026) -> $24,500. So when you expand a capped row in the dashboard, you see exactly why the projected value differs from a straight-line projection.

The 401k total annual limit ($70,000 for 2026) is a per-person aggregate of three sources: employee elective deferral (Roth or Traditional), employer match, and after-tax contributions. The after-tax space is whatever’s left after the other two. By pulling employer_match from the budget plan earner data and total_annual_401k + elective_deferral_401k from the tax config, the cap is computed dynamically — no hardcoded dollar amounts in the projection logic.

The duplication happened because YNAB categorizes 529 contributions as transactions in a “Savings” category group (just like groceries or rent), but the budget plan models them separately as “Savings Goals” with target amounts and progress tracking. By keeping savings exclusively in SAVINGS_GOALS (where it has plan targets like $2000/mo), the dashboard shows the richer data. The Sankey diagram now draws a “Savings Goals” node from Available instead of the generic “Savings” spending group.

The bug here is a classic “hardcoded group name” problem. YNAB’s category grouping is user-configurable, so “529 Plan” could be in any group. The fix searches all groups dynamically rather than assuming a fixed group name, making it resilient to however the user organizes their YNAB budget.

The fix searches all YNAB category groups dynamically using Object.entries() instead of hardcoding a group name. This is more robust because YNAB users can organize categories into any groups they want — the “529 Plan” category happened to be in the “Other” group, not “Savings” as the code assumed. The provenance detail now also reflects the actual group name found at runtime.

The data pipeline already collects “Other” group transactions from YNAB (in budget-webapp.ts). The gap is only in budget-flow.ts where spending groups are hardcoded. The fix needs to also handle that “Other” contains both spending categories (Capital Expenses, Child Adoption, etc.) and savings categories (529 Plan, Emergency Fund) that are already tracked separately in SAVINGS_GOALS.

The key design decision here was making the “Other” group’s presence conditional — it only shows up if there’s actual data, prev data, or a plan. This prevents an empty “Other” row from cluttering the table when the user hasn’t assigned any transactions to capital expenses yet. But once they recategorize transactions into “Capital Expenses”, it will appear automatically. The savings categories (529 Plan, Emergency Fund) are excluded via a dynamic set built from the budget plan’s savings_goals config, keeping the deduplication logic data-driven.

The fix was a one-line change: adding Object.keys(prevCats) to the allCats set union. Previously, allCats was built from current actuals + plan only. By also including previous year’s categories, any category that had spending last year will appear this year as a row (with actual=0 if nothing’s been spent yet). This applies universally to all spending groups, not just “Other” — for example, “Refundable” in P2 Expenses also now shows up with prev=$6,650 despite having $0 actual this year.

The sort change from Math.abs(b.amount) - Math.abs(a.amount) to a.amount - b.amount is subtle but important. YNAB represents outflows as negative numbers. Sorting ascending (most negative first) naturally puts the biggest spending at the top. A large refund (+$500) would sort to the bottom, only appearing if there aren’t 5 outflows. This is more useful than the absolute-value sort which would mix a $500 refund in among $500 purchases.


Courses

Extensive analysis of Gavel cluster simulator experiments at different job arrival rates, investigating straggler jobs, OOM issues, and the gap between LP optimization theory and real-world cluster scheduling. The work evolved into designing and implementing a migration penalty mechanism that accounts for per-job switching costs, making the scheduler’s allocation decisions more realistic and stable.

The key takeaway: the 60 jph experiments are fully usable for analysis (995-997/1000 window jobs). The 180 jph data is partially usable (90-98% complete) for relative policy comparisons. The 360 jph data shows the cluster is saturated — FGD’s extra scheduling overhead becomes visible as it completes the fewest rounds per wall-clock hour (~480 vs ~1,130 for strided placement). This “FGD is slower under saturation” finding is itself a meaningful result.

The steady-state methodology works like this: first 4000 jobs are “warm-up” to fill the cluster and reach steady state. Then jobs 4000-4999 are the measurement window — their JCT is what gets reported. The simulation continues until all 1000 window jobs complete (or it times out). This same window_start=4000, window_end=5000 is used across all configs — both Philly replication and Alibaba FGD experiments.

The Gavel simulator uses a heavy-tailed job duration distribution (drawn from the Philly/Alibaba traces). Among 1000 measurement-window jobs, most complete quickly, but a few have extremely long training durations. After ~500 simulated hours and ~32,000 total completions, these 3-5 stragglers are still grinding through their training steps. The cluster has plenty of spare capacity (only 40% utilized) — it’s not a resource bottleneck, it’s just that these specific jobs are enormous.

This is exactly the “straggler problem” in distributed systems: overall completion time is dominated by the longest-running tasks in the tail of the distribution.

The completion order is not the same as the submission order. A job submitted at ID 4050 might need 200 hours of GPU training, while a job submitted at ID 20000 might only need 2 hours. The later job finishes first. Those 3 straggler window jobs just happen to have very long durations — they were submitted early but are still grinding through training while thousands of faster jobs submitted after them have come and gone.

None of the 4 early exit conditions fired for the 60jph experiments. The simulation was progressing normally — just slowly on those last 3 straggler jobs — when the OOM killer terminated the process. The exit conditions are designed for detecting hopeless situations (saturated cluster, diverging JCT), not for “almost done but a few long jobs remain.” That’s a gap: there’s no wall-clock timeout exit built into the Python code itself, so the only wall-clock protection is the external SLURM limit.

This is a log-uniform (heavy-tailed) duration distribution. The ratio between the longest and shortest possible jobs is ~316x (10^4 / 10^1.5). In 1000 window jobs, you’d expect ~200 in the long bucket, and of those, a few will land near the top of 10^uniform(3, 4). The probability of at least one job exceeding 500,000 seconds (~139 hours) is roughly 1 - (1 - 0.2 * 0.3)^1000 which is essentially 100%. Those tail jobs are the ones that won’t finish — they literally need more simulated time than the experiment managed to accumulate before being OOM-killed.

The most likely explanation is Python memory fragmentation compounded by FarmShare’s SLURM memory limit. Each round creates and destroys thousands of cvxpy objects. Over 5,000 rounds, Python’s allocator holds onto fragmented memory pools that grow monotonically even though the live set is roughly constant. On FarmShare, the default SLURM memory allocation is likely modest (4-8GB). The process gradually creeps up until it hits the cgroup memory limit and the OOM killer fires. The Philly experiments never hit this because their per-round allocation problem was ~40x smaller.

The FGD placement is consolidating jobs onto G2 (the largest pool at 4,392 GPUs), leaving 1,808 GPUs across 5 other types completely idle after the initial flush. Despite using fewer GPUs overall, it achieves 5% better JCT. This suggests that at 60 jph, the cluster is over-provisioned and compact placement on the best GPU type beats spreading across heterogeneous types. The strided baseline wastes time running jobs on suboptimal GPU types where throughput is lower.

The scientific justification is clean: the Gavel paper’s 6-minute round was chosen for a 108-GPU deployment. The perturbation rate at that scale (0.2-0.8%) defines the “acceptable” operating regime. At Alibaba scale, matching this perturbation rate naturally leads to a 10-minute round. This isn’t an arbitrary speedup hack — it’s maintaining the same allocation responsiveness per unit of system change.

The stronger argument: at 0.65% perturbation, we’re solving a 12,000-variable LP to adjust assignments for ~12 jobs out of 1,847. The LP result will be nearly identical to the previous round’s. We’re spending computational effort for negligible allocation improvement.

Why real schedulers don’t do global re-optimization frequently: The LP formulation assumes instantaneous preemption and reallocation — in simulation, switching a job from GPU type A to GPU type B is free. In reality, it requires checkpointing, container teardown, container launch, and model reload. This “friction” means the LP’s optimal solution is only achievable if you give the system time to settle between re-solves. Frequent re-optimization can actually decrease throughput because jobs spend more time migrating than computing. This is why Kubernetes and Borg are event-driven (schedule when something changes) rather than periodic (re-solve every N minutes).

This is a classic gap between optimization theory and systems reality. The LP assumes a “fluid” model where allocation fractions map continuously to throughput. Real clusters are discrete — jobs run on specific GPUs, and moving them has real costs. The Gavel paper acknowledges this implicitly through lease extensions (Section 5), but doesn’t model it in the optimization. AlloX (allox.py:15,121-132) is the only policy that tracks _prev_allocation and prefers stability, but it uses a completely different algorithm (Hungarian method, not LP).

Option B’s auto-scaling is elegant because it unifies two concerns. The migration penalty migration_time / time_per_iteration acts as an implicit “switching tax” that the LP pays per job per round. At short rounds, the tax is high (discouraging switches) — which is correct because you’d waste most of the round migrating. At long rounds, the tax is low (allowing switches when beneficial) — also correct because migration is a small fraction of useful work. You don’t need separate parameters for “how sticky should allocations be” and “how long should rounds be” — one physical parameter (migration time in seconds) governs both.

Why per-job migration cost creates emergent “job classes” in the scheduler. The LP doesn’t explicitly categorize jobs, but the migration penalty creates implicit tiers. Light jobs (small models, single GPU) have low migration cost, so the LP treats them as “fluid” — moving them freely to fill gaps and balance fairness. Heavy jobs (large models, multi-GPU) have high migration cost, so the LP treats them as “anchored” — only moving them when there’s a substantial throughput gain. This mirrors how human cluster operators manage workloads: they’ll freely reschedule small experiments but think twice before migrating a multi-day LLM training run.

The fixed costs dominate for small models. Container restart (30s warm), DataLoader respawn (40s), cuDNN warmup (15s), and CUDA init (5s) add up to 90 seconds regardless of model size. For Gavel’s model zoo (all < 350M params), checkpoint I/O is a small fraction of migration time. But for modern LLMs (7B+), checkpoint I/O becomes the dominant cost — a 70B model’s 782 GB training checkpoint takes 6+ minutes just to read from shared Lustre. This means migration time should really be modeled as fixed_overhead + f(model_size, storage_bw).

The 120B model reveals why per-job migration time matters so fundamentally. With a flat migration penalty, you’d either set it high enough to pin the 120B model (which makes small models unnecessarily sticky) or low enough to let small models move freely (which thrashes the 120B model). Per-job migration cost creates a natural spectrum: the LP aggressively optimizes placement of small, cheap-to-move jobs while treating the 120B model as nearly immovable. This mirrors exactly how real cluster operators behave — they’ll reschedule a quick experiment in seconds, but would never preempt a week-long LLM training run unless absolutely necessary.

The migration cost matrix creates a “gravity well” effect. Each job is naturally attracted to its best GPU types (high throughput + warm cache = low migration cost), and it takes increasingly large fairness pressure to pull it away. Small fairness imbalances won’t trigger migration, but large ones will — exactly the behavior you want. This is fundamentally different from a uniform switching penalty, which treats all moves equally. The matrix version says “moving Recommendation off P100 is expensive (losing both throughput and cache locality), but moving ResNet-18 from V100 to G3 is cheap (similar throughput, both warm).”

Why the penalty slightly improves JCT even on a small cluster: Without the penalty, the LP may oscillate between allocations that are equally optimal — e.g., job X gets 60% V100 / 40% P100 one round, then 40% V100 / 60% P100 the next. Both are equally fair, but the oscillation means the placement layer keeps reassigning physical GPUs. With the penalty, the LP commits to one allocation and sticks with it, reducing “churn” in the placement decisions. The lease extension mechanism (Phase 1 in _schedule_jobs_on_workers) then consistently extends leases, keeping jobs on the same physical GPUs. Less placement churn = slightly better progress per round.

The -q bug is a classic Python logging hierarchy issue. Python loggers form a tree: the root logger is the parent of all named loggers. When you call logging.getLogger().setLevel(WARNING), that sets the root logger level. But logging.getLogger(__name__) creates a named logger (e.g., scheduler), and when you explicitly call setLevel(DEBUG) on it, that overrides the inherited level from root. The fix: don’t set an explicit level on the named logger — let it inherit from root. Or, use the handler’s level to filter instead.

For warm-start caching: cvxpy’s warm_start=True tells the solver “I’ve solved a similar problem before, start from my previous solution.” ECOS is a direct (interior-point) solver that doesn’t truly warm-start — it always factorizes from scratch. SCS is an iterative (ADMM-based) solver that does genuinely warm-start. However, switching our primary solver is risky. The safer approach: keep ECOS primary but pass warm_start=True — cvxpy will cache variable values internally, and if ECOS benefits at all from the starting point, great. If not, there’s no cost.

Why propagate=False + root level check is the right pattern here. Python’s default behavior (propagate=True) sends messages to both the named logger’s handler AND the root logger’s handler, causing duplicate output. Setting propagate=False prevents this, but it also means the named logger no longer inherits root’s level. The fix: explicitly read logging.getLogger().level (which is NOTSET=0 by default, but WARNING=30 when -q is used) and apply it to the named logger. The key insight is that NOTSET (0) vs WARNING (30) gives us a clean signal to distinguish “user set -q” from “default.”

Warm-start caching reuses the previous LP solution as the starting point for the next solve. This matters because:

  1. ECOS is a direct (interior-point) solver — it doesn’t iterate from an initial point, so warm_start=True mainly tells cvxpy to reuse the problem structure (compiled form, matrix factorizations) rather than rebuilding it from scratch each call.
  2. SCS is an iterative (ADMM-based) solver — genuine warm-start: the previous primal/dual variables seed the iteration, converging faster when the problem is similar round-to-round.
  3. The switching penalty we just added stabilizes allocations, meaning consecutive LP problems are very similar — exactly the scenario where warm-start helps most.

Why warm-start works here: cvxpy’s warm_start=True seeds the solver with x.value from a previous solution. Since the switching penalty stabilizes allocations, consecutive rounds have very similar optimal solutions. The key implementation requirement is: (1) seed x.value with the previous allocation before solving, and (2) always save the solved allocation, not just when migration penalty is on. Note: ECOS (interior-point) benefits mainly from problem recompilation, while SCS (ADMM) gets genuine iteration warm-start.

Warm-start and determinism: ECOS is a direct interior-point solver. For ECOS, warm_start=True in cvxpy tells it to try using the initial point, but ECOS computes the exact optimal regardless of starting point (it’s not iterative). So the solution should be identical. However, on the first round there’s no _prev_allocation yet, so use_warm_start=False — the first solve is unchanged. On subsequent rounds, the seeded value shouldn’t affect ECOS’s final answer since it converges to the same optimum. The risk: floating-point differences from different internal paths could shift results by epsilon.