Bhavana AI

AI/ML insights

Dev Log: January 31, 2026

Tools

Debugged systemd environment issues for the Vertex gateway service, fixed timezone handling for Google Calendar event creation via gog, and resolved corrupted session history in MoltBot caused by truncated tool_use blocks from benchmark tests. Built a session reset skill that truncates .jsonl files without deleting them, preserving session ID mappings. Also patched a node_modules issue in Clawdbot that will need to be reapplied on future updates. Explored Claude Code’s teleporting and remote session features.

  • The systemd user service runs in a minimal environment that doesn’t source .bashrc or .profile. Environment variables for the gateway must be set in the systemd drop-in file (vertex.conf), not in shell profiles.
  • The timezone error happened because gog calendar create needs ISO datetimes with timezone info (e.g., 2026-01-31T10:00:00-08:00), but the LLM was likely generating naive datetimes. Setting TZ=America/Los_Angeles helps gog default to Pacific time.
  • The error messages.117.content.1.tool_use.input: Field required means the Google Antigravity API rejected the conversation history because one of the assistant messages had a tool_use block without an input field. This can happen when a tool call fails mid-stream or the response gets truncated.
  • The session name benchmark-135419 was carried over from the benchmark tests we ran earlier. Those rapid-fire test messages accumulated history that eventually got corrupted.
  • Clearing the session forces a fresh start. MoltBot’s memory (long-term) is stored separately, so no knowledge is lost — just the immediate conversation context.
  • The script truncates the .jsonl session file rather than deleting it. This preserves the session ID mapping in sessions.json so the next message reuses the same session slot instead of creating a new one.
  • The messages.117.content.1.tool_use.input: Field required error typically happens when the LLM’s response gets interrupted mid-tool-call (e.g., timeout, rate limit). The partial tool_use block stays in history and poisons all future requests. This reset skill is the escape hatch for that.

This fix lives in node_modules so it gets wiped on every npm install. If you update clawdbot again in the future, you’ll need to re-apply this sed command.

  • Teleporting is one-directional (web-to-CLI). To go the other way, use the & prefix (e.g., & fix the bug in auth.ts) or claude --remote "task" to spin up a new web session from the CLI.
  • /tasks is useful for monitoring all your remote/background sessions in one place.

OpenClaw

Worked through the same systemd environment and session corruption issues as on the tools side, since OpenClaw shares the Clawdbot infrastructure. Fixed the Antigravity API rejection caused by corrupted tool_use blocks in conversation history, and reapplied the node_modules user-agent patch.

  • The systemd user service runs in a minimal environment that doesn’t source .bashrc or .profile. Environment variables for the gateway must be set in the systemd drop-in file (vertex.conf), not in shell profiles.
  • The timezone error happened because gog calendar create needs ISO datetimes with timezone info (e.g., 2026-01-31T10:00:00-08:00), but the LLM was likely generating naive datetimes. Setting TZ=America/Los_Angeles helps gog default to Pacific time.
  • The error messages.117.content.1.tool_use.input: Field required means the Google Antigravity API rejected the conversation history because one of the assistant messages had a tool_use block without an input field. This can happen when a tool call fails mid-stream or the response gets truncated.
  • The session name benchmark-135419 was carried over from the benchmark tests we ran earlier. Those rapid-fire test messages accumulated history that eventually got corrupted.
  • Clearing the session forces a fresh start. MoltBot’s memory (long-term) is stored separately, so no knowledge is lost — just the immediate conversation context.
  • The script truncates the .jsonl session file rather than deleting it. This preserves the session ID mapping in sessions.json so the next message reuses the same session slot instead of creating a new one.
  • The messages.117.content.1.tool_use.input: Field required error typically happens when the LLM’s response gets interrupted mid-tool-call (e.g., timeout, rate limit). The partial tool_use block stays in history and poisons all future requests. This reset skill is the escape hatch for that.

This fix lives in node_modules so it gets wiped on every npm install. If you update clawdbot again in the future, you’ll need to re-apply this sed command.


Courses

Major progress on the FGD (Fragmentation-Gradient Descent) GPU scheduling project. Built a standalone event-driven simulator to validate FGD correctness independently from Gavel, then integrated FGD into Gavel’s placement layer. The integration targets Gavel’s stage 3 (placement), replacing the default strided fill with fragmentation-aware node selection while preserving stages 1 (allocation LP) and 2 (job scheduling).

Analyzed Alibaba GPU cluster traces and found that 79% of pods request exactly 1 GPU with fractional GPU requests being common. Implemented three additional baselines (DotProd, GpuPacking, GpuClustering) alongside FGD, Random, and BestFit. Ran inflation experiments pushing demand to 130% of cluster capacity and validated results against the ATC’23 paper, finding that policy ordering matches expectations: FGD achieves the lowest fragmentation (3.7% at d=0.8) with 16% reduction over BestFit and 57% over Random. Extended the simulator with detailed fragmentation breakdown metrics (non_gpu, stranded, deficient) and comparison plotting against paper reference curves.

FGD is a scheduling heuristic that addresses GPU fragmentation — where free GPU resources are scattered across nodes in unusable fragments. Unlike Gavel (which optimizes for fairness or performance via LP-based allocation), FGD focuses on placement: given that a task needs resources, which node minimizes future fragmentation? The two are complementary — Gavel decides what share each job gets, while FGD could decide where to place it.

Gavel’s architecture has a natural seam for FGD integration. Gavel operates in three decoupled stages:

  1. Allocation (policy LP) — decides what fraction of each GPU type a job gets
  2. Scheduling (_schedule_jobs_on_workers_helper) — greedy selection of which jobs run this round
  3. Placement (_assign_workers_to_job) — assigns specific GPU worker IDs

FGD’s fragmentation-aware logic targets stage 3: placement. Currently Gavel uses a simple strided fill (largest jobs first, fill servers sequentially). FGD would replace this with a fragmentation-minimizing placement that considers the expected workload distribution.

However, there’s a mismatch: Gavel models resources as a flat pool per GPU type (e.g., 36 V100s), while FGD reasons about per-node GPU vectors (e.g., node with 4 GPUs at varying availability). Bridging this requires giving Gavel’s simulation awareness of node topology.

Phase A Architecture: The standalone FGD simulator uses an event-driven design (tasks arrive, get placed, run for a duration, then depart) rather than Gavel’s round-based approach. This matters because the FGD paper (ATC’23) evaluated with event-driven simulation on Alibaba traces. By replicating this first, we validate FGD correctness independently before combining it with Gavel’s round-based loop.

Monte-Carlo workload inflation: Instead of replaying the exact trace, we repeatedly sample tasks from the trace distribution until the cluster fills. This isolates fragmentation effects from arrival-pattern effects — we measure what fraction of GPUs become stranded at different load levels.

Structural fragmentation: Even an empty 4-GPU node has fragmentation for 8-GPU workload types. This is a key FGD concept — fragmentation isn’t just about wasted space, it’s about the mismatch between available resource shapes and requested resource shapes. The fragmentation formula F_n(m) counts GPU capacity that exists but can’t serve task type m.

Phase B integration strategy: The key is that Gavel’s _worker_type_to_worker_id_mapping[type] is a list-of-lists where each inner list is one server’s worker IDs. This naturally maps to FGD Nodes — each inner list becomes one Node, and we track which worker_ids are free/assigned to determine GPU availability per node.

Lease extension ordering matters: The FGD integration preserves Gavel’s lease extension mechanism (Phase 1 of placement). Jobs that were running in the previous round keep their exact worker assignments, and FGD only runs on newly scheduled or preempted jobs. This is critical because: (1) it minimizes preemption overhead, and (2) it means FGD sees the remaining resources after lease extensions, preventing conflicts.

Node bridge design: Rather than modifying FGD’s Node class, we build ephemeral Node objects each round from Gavel’s worker_type_to_worker_id_mapping. GPU capacity is binary (1.0 free / 0.0 assigned) since Gavel doesn’t currently support partial GPU sharing. CPU/memory are set to large defaults because Gavel doesn’t model CPU constraints — the LP solver handles resource allocation, FGD only controls spatial placement.

FGD placement is per-round, per-worker-type: In each scheduling round, FGD runs independently for each worker type (v100, p100, k80). This is because Gavel’s allocation policy (LP solver) decides which worker type each job gets, and then FGD decides which server within that type. The separation of allocation (Gavel) and placement (FGD) means neither system needs to know the other’s internals — a clean separation of concerns.

Fragmentation history as diagnostic: The _fgd_fragmentation_history list stores (timestamp, worker_type, fragmentation) tuples each round. This enables post-hoc analysis of how fragmentation evolves over the simulation — exactly what’s needed for the Phase D/E/F comparison graphs.

Key findings from the trace:

  1. Only 192 “Succeeded” pods out of 8,152 total. The plan says to filter to Succeeded only, but that’s very few tasks. The reference impl uses all pods (Running, Succeeded, etc.) for the inflation experiment since tasks run forever anyway.
  2. gpu_spec is empty for ALL pods - no GPU type constraints in the default workload variant. This simplifies type matching.
  3. Most pods are 1-GPU (6,989/7,064 GPU pods) with fractional gpu_milli values being very common - this is a key feature of the Alibaba trace that FGD exploits.

The distribution format changed from 3-tuples to 5-tuples (gpu_req, cpu_req, mem_req, gpu_type, popularity). This propagates to build_workload_from_distribution() in run_standalone.py and run_inflation() in simulator.py. The generate_synthetic_trace() uses d[-1] for weights to stay backward-compatible with either format.

The three new baselines differ only in their scoring function:

  • DotProd: Cosine similarity between [cpu, gpu, mem] vectors - encourages balanced consumption of all resource dimensions
  • GpuPacking: Lowest sum(available_gpus) - packs tasks onto already-busy nodes, keeping some nodes completely free
  • GpuClustering: Highest sum(available_gpus) - opposite of packing, spreads tasks to preserve contiguous GPU blocks on each node

The key methodological change in run_inflation_from_tasks():

  1. Demand-fraction stopping instead of utilization-based: cumulative_demand / total_gpus >= 1.3. This matches the reference Go impl where they inflate to 130% of capacity, meaning some tasks are inevitably rejected.
  2. Per-task recording: Every task submission generates a curve point (rather than per-batch), giving smooth curves for plotting.
  3. Pre-shuffled task list: The caller shuffles with a seed, matching shuffle-pod=true in the reference.

Key observations from the test run:

  1. FGD is 25x slower than baselines (220s vs 8s) due to computing fragmentation deltas for every candidate node on every task placement. This is expected - the O(N*M) complexity per task where N=1213 nodes and M=workload types.
  2. Random has 12 rejected tasks, BestFit/FGD have 0 - FGD’s fragmentation-aware placement avoids creating situations where tasks can’t be placed.
  3. 79.4% of tasks need exactly 1 GPU - the workload is dominated by single-GPU jobs, with fractional GPU (0.5 GPU) being the second largest bucket at 15.4%.

The reference Go implementation with tune_ratio=1.3 works by scaling the task list so total demand reaches 130% of capacity. Since the raw trace only gives ~88% demand, we need to repeat the task list (with different shuffling) to reach the target. This is how the reference impl generates enough demand - it doesn’t stop at the end of the task list.

Results validation against the paper:

  1. Policy ordering (Fig 7b frag ratio) - The ordering at peak (~0.8 demand) is: GpuClustering (0.105) > Random (0.086) > DotProd (0.077) > GpuPacking (0.046) ~ BestFit (0.044) > FGD (0.037). This matches the paper’s expected ordering except GpuClustering is worse than Random - which actually makes sense since spreading tasks across nodes preserves large blocks but creates many partially-used nodes that fragment under mixed workloads.

  2. FGD improvement - At d=0.8: FGD (0.037) vs BestFit (0.044) = 16% reduction; FGD vs Random (0.086) = 57% reduction. The paper claims 30-49% vs baselines. Our FGD-vs-BestFit gap is smaller than expected, possibly because the workload is dominated by 1-GPU jobs (79%) which create less fragmentation opportunity.

  3. Curve shape - Monotonic increase then decrease after ~0.8-0.9 demand. The decrease happens because once the cluster is nearly full, the fragmentation metric drops (fewer unallocated GPUs to be “fragmented”). This matches the paper’s Fig 7b shape.

  4. Allocation ratio (Fig 9a) - FGD achieves highest allocation at saturation (95.6% vs 93.6% for Random), meaning 2% more GPUs are usefully allocated rather than stranded as fragments.

Key differences between the paper’s metrics and our current implementation:

  1. Frag Rate (Fig 7a) = fragmented_gpus / unallocated_gpus * 100. This is a ratio of fragmented to unallocated — not fragmented to total. As the cluster fills, almost all remaining GPUs become fragmented, so this hits ~100%. We don’t compute this metric yet.

  2. Frag/Total (Fig 7b) = fragmented_gpus / total_gpus * 100. This is what our code calls frag_ratio. But the paper’s values range 5-16%, while ours range 2-10% — roughly half the magnitude. This suggests our fragmentation calculator may be underestimating, likely because our workload distribution differs from the paper’s.

  3. Unalloc GPU (Fig 9a) = (total_gpus - allocated_gpus) / total_gpus * 100 = (1 - alloc_ratio) * 100. We compute alloc_ratio so this is straightforward to convert.

The simulator uses an “inflation” methodology: tasks are submitted one-by-one with infinite duration (no departures), steadily filling the cluster. At each recording interval, we snapshot cluster state. The new metrics need to be computed from the same node state that already provides fragmentation, allocated_gpus, etc. — just different views of the same cluster snapshot.

The fragmentation breakdown classifies free GPU capacity into three categories per the paper’s taxonomy: non_gpu (CPU/memory bottleneck), stranded (topology mismatch — GPU scalar too low despite having CPU/mem), and deficient (individual GPU slots too small for workload tasks). This three-way split is key to understanding WHY fragmentation occurs, not just how much.

The plotting approach overlays paper reference curves (dashed lines) with our simulation results (solid lines). The reference JSON uses demand_pct (0-130) while our simulator stores demand_fraction (0-1.3), so we multiply by 100 when interpolating. The bar charts (9c, 9d) compare at a single demand point (96%) rather than across the full range.

Key observations from the results:

  1. Policy ordering matches the paper — FGD consistently has the lowest fragmentation, followed by BestFit/GpuPacking, then GpuClustering/DotProd, with Random worst.
  2. Frag Rate (7a) shows our curves track the paper shape well, but our FGD’s frag rate climbs faster at high demand (98.9% vs paper’s 55% at 96%). This is because the paper used 10 seeds (42-51) vs our 3, and possibly different fragmentation measurement details at saturation.
  3. Frag/Total (7b) — our values are systematically lower (e.g., FGD: 3.7% vs 7.0% at d=80%). This suggests our fragmentation calculator may be slightly less conservative than the paper’s Go implementation.
  4. Unalloc GPU (9a) matches well — FGD achieves 4.6% unallocated at 96% demand (paper: 5.5%), confirming the allocation efficiency tracks closely.
  5. Occupied Nodes (9b) — Random fills all nodes very early (1186 at 50% demand), while packing-style policies (GpuPacking, BestFit, FGD) consolidate workloads to fewer nodes. Our GpuPacking/BestFit spread to more nodes earlier than the paper suggests, likely due to heterogeneous GPU types forcing placement across node types.

Personal Finance

Started scoping out a personal finance tool by analyzing an existing Excel-based budgeting system. The spreadsheet uses a top-down methodology (gross income, then payroll deductions, taxes via actual federal brackets for both filing separately and jointly, savings goals, and prioritized expenses at P0/P1/P2 tiers). Also explored AI-driven paystub extraction from PDFs as a data ingestion path.

Before diving into building anything, it’s important to understand the existing system thoroughly. Your Excel-based workflow is essentially a top-down budgeting methodology (income, taxes, savings, expenses, discretionary), which is a well-structured approach that many financial planning tools don’t natively support.

Your spreadsheet reveals a sophisticated personal finance system with several distinct components:

  1. Top-down budgeting — starts from gross income, subtracts payroll deductions (401k, HSA, after-tax contributions), computes taxes using actual federal brackets (both filing separately and jointly), then allocates remaining income to savings goals and prioritized expenses
  2. Priority-based expense allocation — expenses are categorized as P0 (must-pay: home, bills, pets, groceries), P1 (important: auto, shopping, eating out, discretionary), and P2 (nice-to-have: trips, donations, home improvement)
  3. Quarterly cash flow tracking — the 2022/2023 sheets add quarterly actual spending columns and projected year-end totals, comparing against budget
  4. Asset tracking — separate snapshots of all accounts (liquid, investment, debts) with metadata like taxability and accessibility

PDF paystubs are actually one of the best cases for AI-driven extraction. Unlike OCR-based approaches that need template matching per provider, Claude can semantically understand a paystub — finding “gross pay”, “federal tax withheld”, “401k contribution” etc. by meaning rather than position. This makes the solution provider-agnostic by default.