Bhavana AI

AI/ML insights

Dev Log: February 10, 2026

Courses (GPU Cluster Scheduling Simulator)

Tackled a major performance investigation on the cluster scheduling visualizer and simulator. Started by diagnosing why the scrubber was freezing at Alibaba scale, found and fixed a quadratic lookup bug for a 10,000x improvement, then added profiling instrumentation to the simulator itself. Profiled at both Philly and Alibaba scale, revealing that the LP solver dominates at large scale (48% of runtime vs 9% at small scale). Designed and implemented a single-type allocation strategy that eliminates the LP entirely, using a sort-based waterfill algorithm. Discovered that removing the LP shifted the bottleneck to worker assignment (88%), then explored caching and skip optimizations. Also found that demoting verbose logging gave a 1.35x speedup.

The issue is likely memory. decodeRounds() creates 4540 JS arrays of 6200 elements each — that’s 28M JS numbers (~225MB) just for allocations. With 49,160 job objects on top, the browser tab may be OOM-ing or freezing during decode, preventing the scrubber from working.

Found the performance bug. In _onRoundChanged (line 600-603), the metrics section uses sim.jobs.find() (linear search) for every GPU allocation. At Alibaba scale with 6200 GPUs and ~1800 unique jobs, this is ~3.2 million comparisons per scrubber move. Meanwhile, _updateAllocBars (line 731) already lazily builds a Map for O(1) lookups — the metrics code just doesn’t use it.

Performance analysis of the fix:

  • Before: Each scrubber move triggered ~77M comparisons (.find() on 45K jobs, called ~1700 times)
  • After: Each scrubber move does ~6200 Map lookups (O(1) each) — roughly a 10,000x improvement
  • The chart precomputation at load time is O(rounds * GPUs) ~370M iterations, which takes a few seconds but only happens once

Why time.perf_counter() instead of time.time()? perf_counter() uses the highest-resolution monotonic clock available (typically nanosecond precision). time.time() can be affected by NTP adjustments and has lower resolution on some platforms. For profiling sub-second code sections, perf_counter() is the correct choice. The overhead is negligible — a single call takes ~50ns on modern CPUs.

Why use self._logger.info() for the PROFILE line? The scheduler’s logging infrastructure already handles both stderr output and optional file logging. By emitting the profile as an INFO-level log line with a PROFILE prefix, it follows the same pattern as TELEMETRY lines — making it easy to grep from SLURM stderr logs. The -q flag sets logging to WARNING, which would suppress profiling. That’s why the plan specifies running the profile job without -q.

Philly-scale profile baseline (108 GPUs, 50 jobs):

  • scheduling = 60% (of which: worker_assignment 14%, priorities 13%, helper 3%, lp_solve 9%)
  • event_jump_and_completion = 26%
  • telemetry = 13%
  • exit_checks + job_arrivals = ~1%

At Alibaba scale (6200 GPUs, ~1800 active jobs), these proportions could shift dramatically. The worker_assignment phase does O(jobs * worker_types) work with deep-copy of worker ID lists, and telemetry iterates all available workers per GPU type — both scale linearly with cluster size. The LP (lp_solve) scales with active job count, so its share could grow from 9% to dominate.

The profile at Alibaba scale is completely different from Philly scale. At Philly (108 GPUs), LP was 9% of runtime. At Alibaba (6200 GPUs, ~1700 active jobs), LP is 48% of runtime — each solve takes 3.0s (vs 2.7ms at Philly scale — a 1100x increase). The LP fires every ~4 rounds (192 solves / 763 rounds), so rounds with an LP solve take ~4.5s while rounds without take ~0.5s.

The three biggest targets for optimization:

  1. LP solve (48%) — 3s per solve with ~1700 active jobs across 6 GPU types. The problem matrix is ~1700x6 = 10,200 variables.
  2. Worker assignment (17.5%) — Phase 1+2 placement iterating over 6200 GPUs per round.
  3. Event/completion processing (15.8%)_done_callback() called for every completing job each round.

The DPP caching we implemented does help — it avoids recompiling the cone form each round. But the compilation cost is amortized over hundreds of rounds anyway. The dominant cost is ECOS solving a 20K-variable interior-point problem each time. Interior-point methods scale roughly as O(n^2.5) in variable count, so doubling variables gives roughly 5-6x slowdown, which matches the observed 10x (with some overhead from the extra constraints).

Why we need solver_kwargs not just solver: Different solvers accept different parameters. SCS needs eps_abs, eps_rel, max_iters. HiGHS may need time_limit. ECOS has its own options. Rather than special-casing each solver, a generic solver_kwargs dict flows through the entire stack and gets unpacked as **kwargs in cvxprob.solve(). This keeps the policy code solver-agnostic.

Why single-type eliminates the LP: With single-type assignment, the problem decomposes into two phases: (1) assign each job to a GPU type (combinatorial), (2) allocate time fractions within each type (analytical). Phase 2 has a closed-form solution: equalize throughput T across all jobs, where y[i] = T / coeff[i, assigned_type], and T is bounded by capacity constraints. No LP needed — just arithmetic.

Migration penalty becomes trivial: Instead of L1 auxiliary variables in the LP, migration stickiness is built into the assignment heuristic. If a job would lose 15% of a round to migration, the new type must offer 15%+ more throughput to justify the switch.

The sort-based waterfill trick: By sorting jobs within each type by coefficient (ascending), we know that job i will hit the time budget (y=1) before all jobs with higher coefficients. This means we process from lowest to highest: if T_cap > c[i], job i gets y=1 and we subtract its capacity. Otherwise, all remaining jobs get the same T. This is a single O(k) pass after the O(k log k) sort — no iteration loop needed.

Performance decomposition: The sort-based waterfill processes each GPU type independently in O(k log k), where k is the number of jobs on that type. At Alibaba scale (1700 jobs across 6 types), the largest type gets ~960 jobs, so the total work is dominated by a few hundred-element sorts. Compare this to ECOS solving a 20K-variable interior-point problem (O(n^2.5) = O(20000^2.5) operations).

The utilization gap matters: The ECOS LP spreads jobs across all 6 GPU types to maximize cluster utilization (99.9%). The single-type assignment concentrates on G2+T4+G3 (the highest-throughput types) and leaves P100, V100 idle in some rounds. This means single-type is faster per round but might need more rounds (and sim time) to complete the same work, since 30% of GPU capacity could be wasted.

Why worker_assignment dominates for single-type: The single-type allocations concentrate jobs on fewer GPU types (fragmentation = 6.1 vs 43.6), which sounds good for fragmentation but actually means more jobs compete for the same servers. With ~1,600 active jobs mostly on G2, the O(jobs * servers) FGD placement algorithm has to work harder to pack them onto G2 servers. The multi-type ECOS allocation spreads jobs across all types, reducing per-type placement pressure.

The real bottleneck has shifted again: At Philly scale, LP was 8%. At Alibaba scale with penalty, LP was 48%. Now with single-type (no LP), worker_assignment is 88%. Each optimization reveals the next bottleneck.

The current code already has Phase 1 lease extensions, but it rebuilds FGD Node objects from scratch every round even when only a handful of jobs need placement. At 60 jph with ~58 rounds/minute, most rounds have identical allocation — only ~1 new job arrives every 58 rounds. The key optimization is detecting unchanged allocation and reusing previous worker assignments wholesale.

The frozenset comparison is O(total_jobs) per round (~1,700 operations) compared to FGD placement which is O(jobs * servers * gpus) (~63M operations). The comparison overhead is negligible, and when it hits (allocation unchanged), we skip the entire worker assignment pipeline. This is a classic “amortized short-circuit” pattern — do a cheap check to avoid expensive work.

The skip optimization never triggered because at 60 jph with 600s rounds, ~10 jobs arrive and ~10 complete every round. The job set changes every single round, so the frozenset comparison always finds differences. This is a fundamental mismatch: the optimization works when job churn is low relative to round frequency, but our Alibaba workload has very high churn.

The logging demotion gave a 1.35x speedup (0.867 -> 0.644 s/round, 1,383 -> 1,863 rounds). The event processing dropped from 345s to 298s (14% reduction from demoting [Micro-task succeeded] to DEBUG). The scheduling loop also benefits: the [Micro-task scheduled] demotion saves string formatting for ~1,700 jobs/round. More rounds mean more LP solves (467 vs 347), so absolute LP and worker_assignment times went UP — but per-round rates stayed the same.


OpenClaw

Worked on plugin installation and configuration for the OpenClaw ecosystem, discovering how the plugin discovery system works and debugging installation issues with local/linked plugins.

Non-interactive SSH sessions don’t source .bashrc because bash only reads it for interactive shells. The Bun installer added ~/.bun/bin to .bashrc, but SSH commands run in non-interactive mode. We need to either use the full path or ensure it’s in .profile (which is read for login shells).

The OpenClaw plugin manifest (openclaw.plugin.json) is the primary discovery mechanism. It declares the plugin ID, config schema (validated at load time), and which skills to expose. For this plugin, the config is minimal — just the YNAB token — since the rest is self-contained in the SQLite database and local config files.

The openclaw field in package.json tells the OpenClaw package manager how to install this plugin. The peerDependencies entry for openclaw ensures version compatibility is checked at install time. We merged it into the existing peerDependencies rather than creating a separate block.

OpenClaw’s plugin system has two related structures in openclaw.json:

  1. plugins.entries — declares plugin enablement and config (like feature flags)
  2. plugins.installs — tracks where plugins are installed (paths, versions, sources)

Both need entries for a plugin to be discovered and loaded. The installPath in installs is how OpenClaw locates the plugin’s openclaw.plugin.json manifest on disk.

The plugins.installs[].source field has a validation enum. The existing plugins use "npm" as the source. "local" isn’t a recognized value. I need to check what values are accepted — likely "npm", "git", or "path".

OpenClaw has a plugins install --link flag that creates a symlink to a local directory instead of copying. This is ideal for development — it means we can point at our deployed extension directory. The --link option likely sets a different source value in the installs record (probably "link" instead of "npm"), which explains why "local" wasn’t valid.

The OpenClaw plugin install schema uses a Zod validation union for the source field with exactly three valid literals: "npm", "archive", and "path". Our local linked plugin needs "path" as the source type. This is likely what --link would have set if the install had completed.

  • OpenClaw plugins need three things to load: (1) a plugins.entries record in openclaw.json with enabled: true, (2) a plugins.installs record pointing to the install path, and (3) the openclaw.extensions field in package.json pointing to the TypeScript entry point
  • The source field in install records must be one of "npm", "archive", or "path" — anything else fails Zod schema validation
  • Plugin config schemas in openclaw.plugin.json are validated at load time, so marking fields as required means the gateway won’t start until they’re configured

This is a common pattern with AI agents — they default to generating novel solutions rather than discovering existing ones. The fix is a pre-action check baked into the agent’s instructions. The 4-step checklist (check skill, check CLI, use existing, only then build new) creates a deliberate pause before the agent starts coding. The concrete example (finance report) anchors the rule to a real failure case, which makes LLM instructions much more effective than abstract rules alone.

MoltBot auto-discovers skills from ~/clawd/skills/<name>/SKILL.md. The SKILL.md frontmatter description field tells MoltBot when to use it — similar to how Claude Code plugin skills work. The AGENTS.md already instructs MoltBot to “check installed skills first” before building anything ad-hoc.

MoltBot picks up skills from two layers: the SKILL.md frontmatter description field (tells the agent when to activate the skill based on user phrasing), and TOOLS.md (tells the agent user-specific preferences). By adding the scheduling preference to TOOLS.md, MoltBot will use when2meet even for generic “schedule a meeting” requests, not just explicit “create a when2meet” ones.

The confusion happened because both service files coexist. The clawdbot-gateway.service was left behind from the pre-migration era but was disabled. When I re-enabled and started it earlier, it created a conflicting second gateway. The proper service is openclaw-gateway.service (v2026.1.30), which was already running. The old clawdbot-gateway.service file could be cleaned up entirely to avoid this in the future.


Tools (When2Meet Integration)

Built a When2Meet API integration, reverse-engineering the undocumented PHP API. Went through several iterations to get availability saving working correctly after discovering the positional slot mapping behavior.

When2Meet’s architecture is interesting - it’s a classic PHP app with no formal API. All interactions happen via form POSTs, and availability is stored as Unix timestamps for 15-minute slots. The key challenge here is correctly computing slot timestamps accounting for Pacific Time (UTC-8) offsets. Each slot ID is essentially: midnight UTC of the date + time-of-day offset in seconds.

Key API findings from the repos:

  1. Event creation POSTs to /SaveNewEvent.php with multipart form data and pipe-delimited YYYY-MM-DD dates
  2. Login response is plain text containing just the person ID — no JSON, no cookies needed
  3. Slot timestamps are fetched from the event page HTML (TimeOfSlot[n]=timestamp; in JS) — this is the safest way to get correct timestamps rather than computing them manually

The bug: All saved availability ended up on Feb 13 only, suggesting SaveTimes.php doesn’t handle a partial slot list correctly. The iCal-when2meet reference implementation always sends all event slots with a 0/1 availability mask. Sending only available slots (all 1s) likely causes the server to misinterpret positions. I need to send the complete slot array with proper 0/1 mapping.

Why the first attempt failed: When2Meet’s SaveTimes.php expects ALL event slots in the slots field, with a positional 0/1 availability string. When I sent only available slots (all 1s), the server misaligned the availability string against the full slot list, causing all marks to land on the wrong slots. The reference implementation (iCal-when2meet) correctly sends the complete slot array every time — the server uses position-based mapping, not value-based matching.

1. SaveTimes.php uses positional mapping. The availability string is mapped by position against the full slot array. If you only send 19 available slots with "1111111111111111111", the server reads positions 0-18 of the global slot list and marks those as available, ignoring the actual timestamps you provided. You must always send every slot with the corresponding 0 or 1.

2. Event creation returns a relative JS redirect, not a 302. The response is HTML with window.location='./?SLUG'. The ./ prefix must be stripped before building the absolute URL, otherwise you get an invalid domain like when2meet.com..

3. Slot timestamps come from the server, not from math. Rather than computing Unix timestamps (which requires handling timezone offsets, DST, and When2Meet’s internal conventions), fetch the event page and parse TimeOfSlot[n]=<ts>; from the JavaScript. This is authoritative and timezone-safe.

Converting a one-off script into a reusable skill is a great pattern. The key learnings from this When2Meet integration were: (1) the API is undocumented but stable, (2) event creation uses multipart form data with pipe-delimited dates, (3) slot timestamps must be fetched from the event page HTML rather than computed, and (4) SaveTimes.php requires ALL slots with positional 0/1 availability — not just the available ones.


Short Projects (GitHub Outages Analysis)

Scraped GitHub’s status page incident history, analyzed outage frequency trends from 2013 to 2025, and wrote a blog post exploring whether GitHub’s increasing outages correlate with their Azure migration or with organic growth. Also reorganized the short_projects repo structure.

Looking at the existing branches, two distinct project patterns have emerged:

  1. Notebook-only projects (levelsdotfyi_scraper) — flat files in root, no structure
  2. Structured projects (india_air_quality, king-county-housing-heatmap) — each with data/, charts//frontend/, and notebooks/ subdirectories

The new structure should formalize the pattern from the more mature projects while adding a consistent blog post output location. This way, every project follows the same layout regardless of complexity.

Key design decisions in this structure:

  1. post/draft.md as the primary deliverable — This enforces the blog-post-first mindset. The analysis exists to support the article, not the other way around.
  2. Per-project requirements.txt instead of a repo-level one — Since projects are independent and may use very different libraries, isolating deps prevents conflicts and makes each project reproducible on its own.
  3. Branch naming convention project/<name> — Distinguishes project work from the ad-hoc claude/ branches that already exist, and makes it clear what each branch is for in git branch -a.

Why this structure works for a multi-project analysis repo:

  1. Self-contained projects — Each directory has everything needed to reproduce the analysis independently. Someone can cd into any project, install its requirements.txt, and run the notebooks without touching the rest of the repo.
  2. post/draft.md as the forcing function — By making the blog post a required artifact in the structure, it prevents the common anti-pattern of doing analysis that never gets written up. The draft is the deliverable, not the code.
  3. Git submodules were cleaned up — The india-air-quality project had 4 git submodule references for external datasets. These were removed since they add friction; the project README should document download steps instead.

Why Playwright over requests+BeautifulSoup here: The GitHub Status page (powered by Atlassian Statuspage) renders incident history entirely via JavaScript. The raw HTML contains only the page shell and CSS. This is a common pattern with Statuspage — the API only returns recent incidents, but the history UI loads all data client-side. Playwright gives us a real browser engine to execute that JS and then query the resulting DOM.

The raw counts already hint at a story: Incidents were relatively stable at 40-90/year from 2013-2021, then jumped to 136 in 2022, 165 in 2023, and 207 in 2025 (with 2026 on pace for even more at 41 in just 6 weeks). This aligns with the user’s hypothesis about growth-driven load rather than Azure migration being the cause.

Data quality finding: Pre-2019 incidents have identical start and end times (e.g., “05:57 - 05:57”), meaning the status page didn’t record resolution times for older incidents. The duration analysis is only reliable from ~2019 onward. This is a limitation worth noting in the blog post — we can analyze frequency trends across the full 2013-2025 range, but duration trends only from 2019+.

Key research finding: GitHub’s Azure migration history is more nuanced than “they recently moved to Azure.” Some services (Actions runners, Copilot) were born on Azure or migrated earlier, while the core platform (git operations, database, web app) has been on GitHub’s own Virginia data centers the entire time. The “big migration” announced in October 2025 is specifically about moving this core platform. This matters for the analysis because the outage increase started in 2021-2022 — well before the core migration.

Astro static asset handling: In Astro, files in public/ are served as-is at the site root. Since there’s no base path configured (just a site URL), images at public/images/blog/github-outages/01_chart.png will be accessible at /images/blog/github-outages/01_chart.png. The markdown content rendered via <Content /> will resolve standard markdown image syntax against the site root.

End-to-end flow of this project: We went from hypothesis to published blog post in a single session: scrape raw data (Playwright) -> clean & parse (Python) -> analyze & visualize (matplotlib/seaborn) -> research context (web search) -> write draft (Markdown) -> publish to Astro blog (GitHub Pages auto-deploy). The short_projects repo holds the reproducible analysis, and the blog repo holds the published artifact — clean separation of concerns.

What changed in the rewrite:

  • Replaced 7 H2 headers with 3 smaller H3 headers, letting the prose carry the structure instead
  • Eliminated all em-dashes (replaced with commas, periods, “to”, or restructured sentences)
  • Removed the bullet-point lists and data table, weaving those facts into flowing paragraphs
  • Cut the formulaic “The Trend Is Clear” / “So What’s Actually Happening?” header style that signals AI writing
  • Kept the same data points and argument structure, just delivered as narrative

The root cause: Astro scopes <style> blocks by adding data-astro-cid-* attributes to both the CSS selectors and the HTML elements in the component. But content rendered by <Content /> (from markdown) doesn’t get those attributes. So any CSS from global.css (imported inside a scoped <style> block) becomes img[data-astro-cid-xyz] { max-width: 100% }, which never matches the <img> tags inside the markdown output. The fix is <style is:global> or using :global() selectors for rules that need to reach into rendered content.


Bhavanaai Blog (Theme Toggle Fix)

Fixed a light mode bug where prose text stayed white after toggling from dark to light mode.

Issue #4 root cause: The light theme text color (--text: #333333) is fine, but the problem is likely that Tailwind’s prose typography plugin applies its own colors. In dark mode, dark:prose-invert flips prose colors to white. But the issue is the opposite: when switching from dark to light, the prose colors may not reset properly, or the Tailwind typography defaults are overriding the CSS variable. The key fix is ensuring the html element (and its children) properly respond to the theme toggle, and that the typography plugin respects the CSS variables.

Found the light mode bug. In the compiled CSS, dark:prose-invert compiles to @media(prefers-color-scheme:dark) instead of .dark .prose-invert. This means the prose colors follow the OS dark mode preference, not the manual toggle. When a user on a dark OS clicks “Light”, the background switches (via CSS variables on .dark class) but the prose text stays white because the OS media query still matches. The fix is to tell Tailwind v4 to use class-based dark mode instead of the media query default.