Bhavana AI

AI/ML insights

Dev Log: February 9, 2026

Courses (GPU Cluster Scheduling Simulator)

A full day of tuning the LP-based GPU scheduler. The core challenge was making cvxpy’s DPP (Disciplined Parametrized Programming) work at scale with migration penalties, then discovering that DPP itself becomes the bottleneck at Alibaba-scale workloads. Also built out the visualization pipeline so heatmaps can show per-GPU allocations.

DPP (Disciplined Parametrized Programming): cvxpy compiles a problem into standard cone form. Without DPP, this compilation happens every call. With cp.Parameter, cvxpy compiles once and reuses the compiled form — only updating the numeric values. The key requirement: parameters must appear affinely in the canonicalized problem. Our switching penalty alpha * |x - x_prev| is DPP-safe because alpha multiplies a variable expression and x_prev appears in linear constraints (the abs is reformulated as linear constraints on auxiliary variables).

Why migration penalty improves cache hit rate: The cache key is the problem shape (m, n). Shape changes when jobs arrive or depart. With the switching penalty, jobs stick to their GPUs longer, so fewer scheduling changes per round, and the job set is more stable between consecutive rounds.

Why explicit auxiliary variables fix DPP: cvxpy’s DPP verifier works at the atom level. cp.abs(x - param) is a compound expression that cvxpy can’t decompose into parameter-affine form. But when we manually reformulate it as t >= x - param; t >= param - x; t >= 0, each constraint is visibly affine in the parameter. cvxpy can then compile the problem once and parametrically update the constraint matrices on subsequent solves.

The DPP “too many parameters” warning occurs because cvxpy’s DPP compilation traverses every cp.Parameter atom to verify they appear affinely. At Alibaba scale (3500+ jobs x 6 GPU types = 21,000+ parameter elements per matrix, ~5 matrices = 100k+ atoms), this compilation dwarfs the per-solve savings. The fix is a size threshold: below it, DPP caching helps (Philly-scale); above it, we build the LP fresh each round using concrete numpy values where cp.abs(x - numpy_array) works natively without the DPP reformulation trick.

The simulation throughput scales inversely with active job count. Exp 0 (1689 jobs) runs ~9x faster than exp 2 (4287 jobs), even though the LP scales as ~O(m^2). This confirms both LP size and event-loop volume are bottlenecks at scale.

The bottleneck pattern is clear: at 360 jph, the active job count grows to 5000+ and each scheduling round involves an LP with 5000 x 6 = 30,000 variables. Even without DPP overhead, the raw LP solve time + simulation event processing at this scale dominates runtime. The original experiments had the same problem — they timed out at 4 hours at 180/360 jph.

The viz tool needs 4 types of log data, but they’re split across two log levels:

Viz featureLog data neededCurrent level
Timeseries (utilization, jobs)TELEMETRY {...}INFO
Job metadata (types, arrivals)EVENT {"event":"job_arrival"...}INFO
JCT / completionsEVENT {"event":"job_complete"...}INFO
Heatmap (per-GPU allocations)[Micro-task scheduled] ... Worker ID(s): ...DEBUG

The heatmap is the missing piece. We demoted [Micro-task scheduled] to DEBUG to reduce noise, but the viz tool’s parse_allocation() specifically parses Worker type and Worker ID(s) from those lines. Without them, there’s no per-GPU allocation data for heatmaps.

Summary of what’s ready to go once SSH is back:

  • scheduler.py line 1828: [Micro-task scheduled] promoted back to INFO — provides Worker ID(s) for heatmaps
  • All 4 data channels now at INFO: TELEMETRY, EVENT job_arrival, EVENT job_complete, [Micro-task scheduled]
  • New SLURM script submit_alibaba_60jph_viz.sbatch runs index 0 (60 jph) without -q for full viz output
  • preprocess_viz.py can then convert the stderr log to .viz.bin with complete heatmap + timeseries data

The viz tool is a static site that uses fetch() to load .viz.bin files from data/. Browsers block fetch() on file:// URLs due to CORS restrictions, so a local HTTP server is required. python3 -m http.server is the simplest zero-dependency option.


Personal Finance

A big refactoring day. Extracted hardcoded tax/deduction logic into declarative source mappings, built new CLI commands for paystub and W-2 parsing with --save flags, updated the frontend Sankey visualization with dynamic earner dropdowns and color generation, and organized everything into logical commits.

DB query pattern: The codebase consistently uses bun:sqlite directly with new Database(DB_PATH, { readonly: true }) for reads and new Database(DB_PATH) for writes. All JSON is stored as TEXT columns and deserialized with JSON.parse(). This pattern keeps things simple — no ORM overhead.

CLI pattern: Each command exports a default async function(args: string[]) that receives post-subcommand arguments. Output goes to console.log (stdout, machine-readable) and diagnostics to console.error (stderr, human-readable).

YNAB spending extraction: The processYear() function in budget-webapp.ts already does exactly what ynab-spending needs — it fetches transactions, builds category group mappings, and computes spending totals. The key trick is that YNAB amounts are in “milliunits” (1000 = $1) and outflows are negative, so the conversion formula t.amount < 0 ? Math.abs(t.amount) / 1000 : -(t.amount / 1000) normalizes to positive = spending.

Inflows filter: YNAB “Inflow: Ready to Assign” transactions represent actual income. Transfers and housekeeping entries (starting balances, reconciliation) are filtered out to get genuine inflows from paychecks.

—save flag pattern: The parse commands currently just output JSON. Adding --save requires DB access, so the import of bun:sqlite is conditional — only loaded when the flag is present. This keeps the pure parsing function side-effect-free (testable), while the CLI wrapper handles persistence. The --person and --year flags provide metadata that can’t be inferred from the PDF itself.

Why source mappings matter: The current budget-flow.ts has ~15 hardcoded references like if (label.includes("Social Security") && label.includes("Varun")). Each tax/deduction type requires knowing: (1) what W-2 field to check, (2) what paystub field to fall back to, (3) how to label it. A lookup table captures this knowledge declaratively, making it trivial to add a new earner or tax type without touching flow-building logic.

Backward compatibility: The existing budget format uses flat keys like ss_varun, medicare_kristin. The new taxes.items[] array format is cleaner for generic processing. A shim function convertLegacyTaxes() bridges the gap so existing budgets keep working.

Refactoring strategy: Rather than rewriting the entire 1500-line file at once, I’ll surgically replace just the TAXES section (lines 242-412) and DEDUCTIONS section (lines 414-599) with generic loops. The key is that the output (FlowLine[]) must be byte-identical to the golden output. The source mappings encode the same logic as the if/else chains, just declaratively.

Tricky part — Federal Income Tax summing: Federal tax is special because it sums box2_federal_tax across ALL earners’ W-2s, while per-person taxes like SS/Medicare look up a specific person’s W-2. The sumAll: true flag in TaxSource handles this.

Analyzing the diffs: The changes fall into three categories:

  1. Expected improvements: “Additional Medicare (Varun)” now shows up as its own line (it was silently dropped before), and budget v2 (computed) now correctly says budget v5 (computed).
  2. Minor provenance label changes: retirement["After Tax"] -> retirement["401K After Tax"] (more specific), sum(non-ESPP/DCFSA/HSA) -> sum(non-ESPP/ESPP Refund/DCFSA/HSA) (more accurate).
  3. Ordering change: “Social Security (Kristin)” moved after “Additional Medicare (Varun)” because convertLegacyTaxes() processes fields in order (federal, then per-person SS, Medicare, Additional Medicare, WA PFML).

These are all improvements, not regressions. The numeric values (plan/actual/projected) are unchanged.

The pattern here is a classic module extraction refactor: logic that was previously inline in budget-flow.ts and cli.ts has been split into dedicated command files and a shared source-mappings.ts library. The net change (+353/-268) shows growth, but the 268 deleted lines from budget-flow.ts likely moved into the new files rather than being truly removed. This kind of decomposition improves testability and makes each CLI subcommand independently maintainable.

Frontend changes needed:

  1. Dynamic earner dropdown: The <select id="upload-person"> is hardcoded with Varun/Kristin. We need to populate it from the API response, which means the backend needs to expose earner names.
  2. Dynamic Sankey colors: The COLORS record has hardcoded group names like “P0 Expenses”. We need to generate colors for any group name that appears in the data, using a consistent color palette.
  3. API change: Add earner names to the /api/flow response so the frontend can build the dropdown dynamically.

Organizing commits by logical change rather than by file makes the history much more useful. Each commit here is independently reviewable and tells a coherent story: new commands, new parse flags, a refactor, and a frontend update. If a bug surfaces later, git bisect can pinpoint exactly which logical change caused it.

The biggest commit by diff size is the source-mappings refactor (+400/-226), but the net growth is only ~174 lines because it replaced ~226 lines of repetitive if/else chains with lookup-table-driven logic. This is a classic example where a refactor’s raw diff looks large, but the actual complexity decreased — the lookup tables in source-mappings.ts are declarative data, not branching logic.

Bun.serve() binds to 0.0.0.0 by default (all interfaces), unlike some frameworks that default to 127.0.0.1 (localhost only). This means it’s already LAN-accessible without config changes. For internet exposure, a reverse tunnel like cloudflared is preferred over port-forwarding because it doesn’t require touching your router and provides HTTPS automatically.

The main wrinkle is SQLite + containers. Containers have ephemeral filesystems, so pf.db would be lost on every restart without a volume mount. Azure Files provides a POSIX-compatible SMB mount that works for low-concurrency SQLite (fine for a 2-person app). For higher scale, you’d migrate to Azure PostgreSQL (Bun has built-in Bun.sql for Postgres). Easy Auth is the real win here — it adds Microsoft login in front of your app with zero code changes, and you just whitelist two email addresses in the Azure portal.

The paystub upload handler (lines 238-289) is a good template for adding more admin endpoints. It follows a clean pattern: accept form data, parse/validate, write to SQLite in a transaction, return a summary. A W-2 upload would be nearly identical. For YNAB sync, you already have the sync logic in ynab-sync.ts — exposing it behind a POST /api/ynab-sync button would be straightforward. The budget plan is the trickiest since it’s a full JSON document, but a simple textarea + JSON validation would work for a 2-person app.

The skill follows the “CSO” (Claude Search Optimization) principle from the writing-skills guide. The description contains only triggering conditions (“Use when…”) and avoids summarizing the workflow. This ensures future Claude instances will load the full skill and follow its workflows rather than taking a shortcut from the description. The skill is ~1400 words — larger than typical because it’s a reference skill (like the pptx/xlsx skills) that serves as the complete operational manual for 18 CLI commands across 3 distinct workflows.


Azure Cleanup

Cleaned up unused Azure AI Foundry resources, deleting accounts with no active deployments and removing a resource group.

  • All your active model deployments are in the East US 2 region under a single AI Services account. The other two accounts in Sweden Central have no deployments, so they’re not incurring compute costs (though you may still have base resource costs).
  • Azure AI Foundry uses the AIServices kind (rather than the older “OpenAI” kind), which is the unified resource type that supports models from multiple providers (OpenAI, Anthropic, DeepSeek, etc.) under one account.
  • Azure AI Foundry accounts often have nested project resources (used by the Foundry portal for workspace organization). These must be deleted before the parent account can be removed — the CLI gives a clear error with the nested resource ID, which you can pass directly to az resource delete.
  • Deleted Cognitive Services accounts enter a soft-delete state for 48 hours. If you need to recover them, you can use az cognitiveservices account recover. After 48 hours they’re permanently purged.
  • Resource group deletions in Azure are asynchronous and can take 5-10+ minutes, particularly when the group contains resources in a soft-delete state (like the Cognitive Services account we deleted earlier).
  • You can check the status anytime with az group show --name "oews-rg" --query "properties.provisioningState". Once it disappears, the deletion is complete.

OpenClaw

Investigated why the heartbeat agent stopped sending scheduled messages after a gateway restart.

OpenClaw uses a “heartbeat” pattern where the gateway periodically pings the LLM agent with a configurable prompt. The agent reads HEARTBEAT.md for scheduled tasks (morning mindfulness, cost report) and tracks last-run timestamps in a state file. This is essentially cron-via-LLM — flexible but dependent on the gateway staying healthy.

The heartbeat agent runs on the heartbeat message channel (internal), not the Signal channel. Its text output needs to be explicitly routed to Signal. Before the Feb 8 restart, the gateway was logging heartbeat text output (e.g., HEARTBEAT_OK in journalctl). After the restart, no heartbeat text appears in journalctl at all — suggesting the output routing is broken.