Dev Log: January 30, 2026
Courses
Fixed a subtle GPU allocation bug in the Gavel cluster simulator visualization. The simulator emits a full set of allocation lines each round for all running jobs, but the preprocessing script was treating these as cumulative, never clearing the allocation array. Once a GPU was assigned, it stayed “allocated” forever, even after the job completed. The result was a GPU grid showing a fully packed cluster while real utilization was only 20%.
After fixing the allocation parsing, extended the binary data format to support per-job completion metrics. Adding completion_round and duration to the job metadata section (rather than per-round records) was the key design choice: it keeps round records unchanged while enabling both a 100-job moving window JCT chart and a CDF probability distribution on the JS side. The 24-byte job metadata (padded from 20 for 8-byte alignment) adds only about 8 bytes per job, negligible compared to per-round allocation data.
Built out several new chart panels: GPU occupancy vs. Gavel’s throughput-weighted utilization, a queue depth and arrival rate overlay, and a JCT empirical CDF. The CDF replaced an initial KDE approach after realizing Gaussian kernels create density at impossible values (near zero or negative JCT), and that the empirical CDF is what systems papers actually use. Added crosshair tooltips with pixel-to-data coordinate mapping, two playback view modes (“all” with a growing x-axis and “rolling” with a fixed sliding window), and a help modal.
The allocation bug is subtle: the simulator emits a complete set of [Micro-task scheduled] lines each round for ALL currently running jobs. The preprocess script was treating these as cumulative (never clearing the allocation array), so once a GPU was assigned, it stayed “allocated” forever — even after the job completed. This made the GPU grid show a fully-packed cluster while utilization was actually 20%.
The GPU allocation count doesn’t exactly match gpu_used from telemetry because gpu_used counts individual GPUs of each type in use, while allocations tracks which job owns each GPU slot. The small discrepancy (e.g., round 5: 98 allocated vs 106 used=[36,36,34]) happens because allocation lines and telemetry capture slightly different moments in the scheduling cycle. But they’re now in the right ballpark, unlike the broken 108/108 from before.
The sim.jobs.find(j => j.jobId === jobId) calls inside the allocation loop are O(n) per lookup. With 1,268 jobs and up to 108 GPUs, that’s ~137K comparisons per round update. For real-time playback this is fine, but if the dataset scaled to 10K+ jobs, you’d want to build a Map<jobId, job> at load time for O(1) lookups.
Gavel’s utilization metric is throughput-weighted, not GPU occupancy. It accounts for heterogeneous GPU performance — 28 V100s doing 53.5% of the cluster’s theoretical throughput makes sense because V100s are faster than K80/P100s. But for a visual tool, GPU occupancy (GPUs used / total) is more intuitive. Showing both lets you see the difference: high occupancy + low effective utilization means jobs are running on suboptimal GPU types.
The round data already contains all three metrics we need: gpuUsed[] for occupancy, utilization for Gavel’s effective metric, and avgJct for JCT. The avgJct in the binary is a running average over all completed jobs, so we’ll need to compute a 100-job sliding window separately from job completion events stored in the preprocessor. However, for the first pass, we can use the running avg JCT already in the data and add the windowed variant later if needed.
The binary format extension is the foundation change. By adding completion_round and duration to each job’s metadata, we get per-job JCT data without changing the round records at all. This is cleaner than adding fields to the per-round structure because: (1) job completions are sparse events, not per-round data, (2) it enables both the 100-job moving window and the PDF computation on the JS side, and (3) the metadata section is read once at load time, so the extra 8 bytes/job has negligible performance impact.
Binary format extension: Adding per-job completion data (completion_round + duration) to the metadata section rather than the per-round records was the key design choice. It keeps round records unchanged (no recomputation of all section offsets) while enabling both the 100-job moving window JCT and the KDE probability density on the JS side. The 24-byte job metadata (padded from 20 for 8-byte alignment) adds only ~8 bytes per job, negligible compared to per-round allocation data.
KDE for the PDF chart: Silverman’s bandwidth rule h = 1.06 * sigma * n^(-1/5) gives a reasonable automatic bandwidth for Gaussian kernels. The PDF is evaluated at 200 evenly-spaced points, which is enough for smooth curves without being expensive. The optimization of only recomputing when completedJobsCount changes avoids O(n*200) work on every frame during playback.
The change is to draw lines only up to currentRound instead of the full series. The x-axis stays fixed at 0 to maxRound so the chart doesn’t rescale/jump as you scrub. The line endpoint naturally becomes the “playhead” — no separate playhead marker needed. The Y-axis range should also only consider data up to the current round, so the scale adapts as new data is revealed.
The two view modes solve different problems:
- All mode: X-axis grows from 0 to the current round. Early rounds fill the chart so you can see detail. As the simulation progresses the x-axis stretches and you see the full history.
- Rolling mode: Shows a fixed-width sliding window (default 100 rounds). The chart always shows recent data at full resolution regardless of how far into the simulation you are. The window slides forward as the scrubber moves.
Both modes compute Y-axis range only from the visible data, so the scale adapts naturally. The x-axis labels show simulated hours (from simTime stored per round) so you can correlate with real experiment time.
Raw arrival counts are typically 0 or 1 per round (Poisson-like), making them nearly invisible on the same scale as queue length. A 10-round moving average smooths the signal into a readable “arrival rate” curve. This shows the load being placed on the scheduler (arrivals) vs the backlog (queue depth) — when arrivals consistently exceed the service rate, the queue grows. The relationship between these two curves tells you whether the scheduler is keeping up.
Adding crosshair tooltips to canvas charts requires careful layering: you render the base chart first, then overlay the crosshair and tooltip on top. The key challenge is converting pixel coordinates back to data coordinates (inverse of what _drawSeries does). We also need to handle mousemove/mouseleave events and trigger re-renders efficiently without disrupting the main animation loop.
CanvasRenderingContext2D.roundRect() was added in Chrome 99 and Safari 15.4 (2022). Since this is a local dev tool, it should be fine. But if it were a production app, you’d want a polyfill drawing 4 arcs + 4 lines.
The getImageData/putImageData approach for snapshotting is efficient here because these charts are small (570x200). For larger canvases, you’d want to use a second offscreen canvas instead, since putImageData bypasses compositing and can be faster than drawImage but doesn’t handle transforms.
The PDF chart’s data model is fundamentally different from the time-series charts. Instead of values[roundIndex], it has KDE curves with {x, y} points where x = JCT in hours and y = density. The tooltip needs to map pixel position to the x-axis (hours), then interpolate the KDE curve to find the density at that point.
Linear interpolation between KDE evaluation points works well here because the KDE is evaluated at 200 evenly-spaced points, giving fine enough resolution. For the time-series charts, we snap to the nearest integer round index since the data is discrete per-round. The PDF chart keeps the continuous x-value since JCT is a continuous quantity.
Using the last N completed jobs (sorted by completion round) rather than all jobs makes the PDF chart act as a “recent performance” indicator. Early in the simulation when fewer than N jobs have completed, it naturally shows all available jobs. As the simulation progresses, the distribution shifts to reflect current scheduling behavior, which is much more useful for comparing policies in real time.
This is likely a data issue in how job durations are recorded, not a visualization bug. If many jobs have duration = 0.0 in the binary data, the KDE will show a spike near zero. Let’s check the preprocessing pipeline to see how durations are computed.
KDE with Gaussian kernels always produces density outside the data range because the kernel has infinite support. For non-negative data like JCT, this is problematic since it creates density at impossible values (near zero or negative). Truncating the evaluation range at minVal is a pragmatic fix. A more rigorous approach would be boundary correction (e.g., reflection method), but that’s overkill for a visualization tool.
The Gaussian kernel has ~99.7% of its mass within 3 standard deviations. Extending the evaluation range by 3 * bandwidth in each direction ensures the curve tails off smoothly to near-zero density at the edges, giving the natural bell-curve shape you’d expect. The Math.max(0, ...) still prevents evaluating at negative JCT values, but since bandwidth is typically much smaller than minVal, this won’t clip the curve in practice.
The empirical CDF is a much better fit for this data. It’s the standard way systems papers present JCT distributions (including the Gavel paper itself). The step function directly represents the data: each horizontal segment means “no new jobs completed between these JCT values,” and each vertical jump represents a job completion. The tooltip now shows percentiles: hovering at any x shows “X% of jobs completed within Y hours,” which is immediately actionable for comparing policies. No smoothing parameters to tune, no artifacts.
The modal uses hidden attribute toggling rather than CSS class toggling. The .modal-overlay[hidden] { display: none } rule overrides the flexbox display, making hidden work as a simple boolean toggle. The overlay click handler checks e.target === helpModal to distinguish clicks on the backdrop from clicks inside the modal content — without this, clicking any text inside the modal would close it.
Tools
Set up Google Antigravity as a multi-model API gateway for Clawdbot’s fallback system, ran latency benchmarks across providers, and reordered fallback chains based on the results. Copilot turned out to have the most consistent latency for Claude Opus (3.1-4.3s), while Antigravity showed a cold-start penalty (9.9s on round 1 dropping to 2.7s by round 3). Reordered the chains to use Copilot as primary with Antigravity as fallback for standard Claude, keeping Antigravity primary for the “thinking” variant of Opus 4.5 since it is the only provider offering extended reasoning.
Also integrated the gog CLI for Google Calendar access via MoltBot on WhatsApp, enabling natural language calendar queries and event creation through OAuth with a GCP project.
- Google Antigravity is Google’s multi-model API gateway — it gives you access to not just Gemini but also Claude and GPT models, all through a single Google auth flow
- The available models include
claude-opus-4-5,claude-sonnet-4-5,gemini-3-flash,gemini-3-pro, and evengpt-oss-120b— all via one provider - The OAuth flow will open a browser for Google sign-in, which is why it needs a TTY (and why RDP is handy here)
- Antigravity is uniquely valuable because it’s the only free option that gives you cross-provider models (Claude + Gemini + GPT) through one auth flow
- Stacking the same model across providers (e.g., Gemini 3 Flash via Antigravity, then via API key, then via CLI auth) effectively multiplies your rate limit since each provider has independent quotas
- Clawdbot’s fallback system is provider-aware, so a 429 from one provider triggers the next in the chain rather than failing the request
- Per-model fallback chains are not natively supported in Clawdbot — the fallback list is global. The
switch-model.shscript works around this by atomically swapping both the primary model and the fallback list together. - Provider stacking (same model across Antigravity, Copilot, and API key) effectively multiplies your rate limit since each provider has independent quotas.
- The skill triggers on natural language, so saying “use 5.2” in WhatsApp causes the LLM to recognize the intent and execute the script — no slash command needed.
- Removing Antigravity from the GPT chain makes sense since it doesn’t have GPT 5.2 — it only has
gpt-oss-120b, which is a different model entirely - The Codex -> Copilot chain keeps you within the same model family (GPT 5.2) across both providers
- Google’s Antigravity API enforces minimum client versions via user-agent strings — older versions get a hard rejection
- This is a common pattern with Google APIs to force clients to update for security/compatibility
- The fix (
sedon anode_modulesfile) will get overwritten onnpm update— keep this in mind for future Clawdbot updates
- The fallback mechanism is provider-aware — it tries each provider in sequence and stops at the first success, never wasting calls on already-failed providers
- OAuth tokens for Antigravity and Codex expire (~1 hour) and need periodic re-auth. The doctor command flags this:
clawdbot doctor - The user-agent fix (
1.11.5->1.15.8) will be overwritten on Clawdbot updates — worth checking after anyclawdbot update
- Antigravity was the fastest despite routing through Google’s gateway to Anthropic’s Claude — impressive for a proxy layer
- Codex CLI’s 22s is misleading — the model decided to call the
session_statustool before responding, adding a round-trip. A pure text reply would likely be closer to Copilot’s 10s - Gemini API at 19s is surprisingly slow for a Google-direct connection — Gemini 3 Pro may just be a heavier model with longer inference times compared to Flash
- These are single-sample measurements, so treat them as directional, not definitive. Real-world latency varies with load, prompt length, and response complexity
- Cold-start effect: Antigravity’s Round 1 was 9.9s but Round 3 was 2.7s — the Google proxy appears to cache/warm connections. For sustained use, Antigravity becomes competitive
- Copilot is surprisingly fast for Claude — GitHub’s infrastructure likely has a direct peering arrangement with Anthropic, minimizing hops
- Codex CLI and Copilot for GPT are nearly identical in speed (3.6s vs 4.1s) — OpenAI likely serves both from the same inference cluster
- Rate limit resilience matters more than raw speed — a 2s latency difference is negligible for WhatsApp, but having fallbacks when a provider is rate-limited is critical
The benchmark showed Antigravity has a cold-start penalty (9.9s on R1 vs 2.7s on R3) but even warmed up, Copilot was consistently faster for Opus. Making Copilot primary with Antigravity as fallback gives you the best of both worlds — fast primary with a reliable backup.
- Copilot’s consistent latency (3.1-4.3s) makes it a better primary than Antigravity, which showed a cold-start penalty (9.9s on R1 dropping to 2.7s by R3). Antigravity as fallback still gives you access to its “thinking” variant if Copilot ever hits rate limits.
- The fallback chains for Gemini profiles still lead with Antigravity since its warm latency (3.6-3.9s) was competitive, and the cold-start cost is acceptable for a less frequently used profile.
- Antigravity is the only provider offering the
thinkingvariant of Opus 4.5, which enables extended reasoning/chain-of-thought. Copilot’sclaude-opus-4.5is the standard variant. So keeping Antigravity as primary is the right call when you want thinking mode — the latency trade-off (5.5s vs 3.6s) buys you the reasoning capability. - For Gemini, Antigravity’s
gemini-3-pro-highalready maps to the highest-capability tier, which is their equivalent of “thinking” mode.
The gog CLI uses Google OAuth with a client_secret.json from a Google Cloud project. This means you need a GCP project with the Calendar API (and optionally Gmail, Drive, etc.) enabled, and an OAuth 2.0 Client ID configured. The gog CLI handles the OAuth flow itself — you just need to provide the credentials file.
- The
gogCLI stores OAuth refresh tokens locally at~/.config/gogcli/, so auth persists across reboots. No re-auth needed unless you revoke the token. - Your GCP project is in “Testing” mode, which means tokens expire after 7 days. To make them permanent, you’d need to publish the app or submit for Google verification. For personal use, you can just re-auth when it expires, or switch to “Internal” if you have a Workspace account.
- MoltBot can now respond to WhatsApp messages like “what’s on my calendar today” or “create a meeting with X tomorrow at 3pm” using the
gogskill.
OpenClaw
Configured Clawdbot’s multi-provider fallback system with Google Antigravity, GitHub Copilot, and Codex CLI. Benchmarked latency across providers and optimized the fallback chain ordering based on results. Also investigated compaction strategies for long-running WhatsApp conversations and the impact of system prompt size on token usage.
- “Safeguard” compaction waits until context is nearly full, leading to large token counts for long-running WhatsApp conversations
- “Default” compaction proactively summarizes older messages, keeping each turn’s token cost manageable
- System prompt size is a hidden multiplier — it’s re-sent with every API call, so 19 skill definitions add up fast