Bhavana AI

AI/ML insights

Dev Log: January 29, 2026

courses

Built the complete GPU cluster visualization pipeline for the Gavel scheduler replication. Started by analyzing why multi-GPU jobs saturate at lower arrival rates than single-GPU jobs (fragmentation from contiguous GPU allocation requirements), then designed and implemented a custom binary file format for efficient visualization of simulation data. The binary format uses fixed-size records with 8-byte alignment for O(1) random access to any simulation round, and includes a separate queue index section for variable-length data. Implemented the full Python preprocessing pipeline (log parser, binary encoder, and preprocessor with 33 passing tests), then built all 7 JavaScript modules for the web visualizer: DataSource with LRU caching, Decoder matching the Python format byte-for-byte, Web Worker for background decoding, Model with observer pattern, Canvas Renderer with dirty-region tracking, and Controller with playback and keyboard shortcuts. Cross-language verification confirmed exact compatibility between Python’s struct.pack and JavaScript’s DataView. Validated the pipeline against real simulation data (20,003 rounds, 1,268 jobs, 108 GPUs) and confirmed correct utilization curves and allocation patterns.

Why multi-GPU saturates at lower rates than single-GPU:

  • Multi-GPU jobs require contiguous GPU allocations (e.g., 4 GPUs on same node)
  • This creates fragmentation - even with free GPUs, jobs may not fit
  • Single-GPU jobs can use any available GPU, packing more efficiently
  • Expect the saturation spike around 3.5-4.0 jph for multi-GPU vs 5-6 jph for single-GPU

Your telemetry data is well-structured for visualization:

  • Telemetry files: Per-round snapshots with GPU utilization (V100/P100/K80), job counts, and timing
  • Completions files: Job lifecycle events with duration/JCT
  • The 36:36:36 config means 108 total GPUs across 3 types, not a 36x36 grid

The simulation logs contain rich allocation data I can use:

  • Job arrivals: job_type, scale_factor (GPU count), total_steps
  • Micro-task scheduling: Which job → which worker_type → which worker_id(s)
  • Worker IDs 72-107 are V100s, meaning IDs encode GPU type (0-35=K80, 36-71=P100, 72-107=V100)

Saturation comparison:

  • Single-GPU (Fig 9): Saturates at ~5 jph, spike from 5-8 jph
  • Multi-GPU Max-Min (Fig 10): Saturates at ~3 jph, spike from 3-4 jph
  • Multi-GPU Finish-Time (Fig 11): Saturates at ~3 jph, spike from 3-4.4 jph

Why multi-GPU saturates earlier: Multi-GPU jobs need contiguous GPU allocations (e.g., 4 GPUs on same node). Even with free GPUs, jobs may not fit due to fragmentation - this is exactly what FGD’s fragmentation-aware scheduling addresses.

These tools were created during the ECOS solver failure investigation to visualize what was happening inside simulations.

  1. extract_telemetry.py - Parses simulation logs to extract:

    • Per-round telemetry (utilization, active jobs, completion rate)
    • Job completion events with timestamps
    • Outputs JSON files for visualization
  2. telemetry_viewer.html - Interactive browser-based dashboard:

    • Plots time series of any telemetry metric
    • Visualizes JCT over time, utilization curves
    • Compares multiple experiments side-by-side
    • Shows simulation phases (warm-up, measurement, saturation)
  3. telemetry_data/ - Extracted data from sample experiments used during debugging

Key architecture improvements from Codex review:

  • Fixed-size records are critical for efficient range requests - variable data must be in separate sections
  • Web Workers for parsing prevent UI jank during data loading
  • Modular architecture (DataSource/Decoder/Model/Renderer/Controller) makes the code testable and maintainable

Task 1.1 complete - Created foundation binary format module with:

  • Magic number GPUVIZ01 for file format identification
  • align_to_8() uses bitwise (x + 7) & ~7 - efficient 8-byte alignment without division
  • This pattern ensures binary sections can be read directly into memory on aligned boundaries

Tasks 1.2b, 1.3, 1.4 complete - Binary format now supports:

  • Job metadata: 16-byte records with uint32 job_id (handles >65K jobs)
  • Round data: Fixed-size records with gpu_used[] (uint16) and allocations[] (uint32)
  • The fixed-size formula 28 + 2*num_gpu_types + 4*total_gpus enables direct offset calculation for streaming

Binary format complete - All 6 binary format tasks done with 24 passing tests:

  • Header: 256 bytes with section offsets for streaming
  • Jobs: 16-byte fixed records (uint32 IDs handle >65K jobs)
  • Rounds: Fixed-size with formula-based offset calculation
  • Queue: Variable-length with index for random access
  • File writer creates complete .viz.bin files

Phase 1 Complete - Python preprocessing pipeline:

  • binary_format.py: Complete binary file format with 24 tests
  • log_parser.py: Parses job arrivals, allocations, telemetry from simulation logs
  • preprocess_viz.py: Converts simulation.log → .viz.bin for web visualization
  • Total: 33 tests passing

Phase 2 Complete - All 7 JS modules implemented:

  • DataSource: LRU cache + AbortController for range requests
  • Decoder: Binary parsing matching Python format exactly (little-endian, 8-byte aligned)
  • Worker: Offloads decoding to background thread
  • Model: Observer pattern for state management
  • Renderer: Canvas with dirty-region tracking (only repaints changed cells)
  • Controller: Full playback, keyboard shortcuts, dual-simulation support
  • Key design: Local files loaded via file.arrayBuffer() (range requests only for HTTP)

Architecture highlights:

  • Binary format: Fixed-size round records enable O(1) offset calculation for any round
  • Dirty-region rendering: Only repaints GPU cells that changed between frames
  • Local file loading: Uses file.arrayBuffer() (not range requests) since file:// doesn’t support Range headers
  • Queue index: Separate section with uint64 offsets allows random access to variable-length queue data
  • Cross-language compatibility: Python struct format <IHBxIxxxx matches JS DataView byte-by-byte

Cross-language verification complete - Key validations:

  • Header fields match exactly between Python struct.pack and JS DataView
  • Round size formula Math.ceil((28 + 2*types + 4*gpus) / 8) * 8 matches Python’s (x + 7) & ~7
  • uint32 values >65535 decode correctly in both languages (job_id=70000)
  • Little-endian byte order consistent via struct.pack("<...") and DataView.getXxx(offset, true)

Architecture Summary - The visualizer uses a clean pipeline design:

  1. Python pipeline: simulation.loglog_parser.pypreprocess_viz.py.viz.bin
  2. JS pipeline: .viz.binDataSource (fetch) → Decoder (parse) → Model (state) → Renderer (canvas)
  3. Binary format: Fixed-size records enable O(1) round access via offset = base + round * size
  4. Dirty tracking: Only changed GPU cells repaint, critical for 108-cell grids at 10fps

Both files contain 20,003 rounds and 1,268 jobs across 108 GPUs (36 K80 + 36 P100 + 36 V100). Each round record is 28 + 2*3 + 4*108 = 466 bytes, aligned to 472 bytes. The rounds section alone is 472 * 20003 ≈ 9MB. The queue section takes up most of the file (~50MB) since each round stores a variable-length list of queued job IDs, and with 1,268 jobs, many are queued at any given time.

The data looks correct: Round 0 has 58 jobs queued with GPUs already allocated (the allocations snapshot captures state between telemetry entries). By round 1, 57 jobs are running. By round 5, utilization is 93.5% with 100/108 GPUs in use. At round 100, utilization drops to 64.5% as jobs complete. The allocation count (90-100 GPUs) vs the total (108) shows the cluster isn’t fully packed, which is expected since some jobs need specific GPU types.


tools

Migrated MoltBot from Vertex AI to the Gemini API for a simpler, cheaper setup, then explored a wide range of new tool integrations. Switched from separate Vertex AI credentials to a single Gemini API key that powers both chat (Gemini 3 Flash for speed) and image generation (Gemini 3 Pro Image via nano-banana-pro). Investigated Apple ecosystem integration options (locked to macOS), set up Microsoft Remote Desktop (now rebranded as “Windows App”), configured Obsidian CLI with snap-compatible symlinks, and explored several media tools including summarize (yt-dlp-based), nano-pdf (Gemini-powered), and video-frames (ffmpeg-based).

  • Vertex AI vs Gemini API: Vertex AI is Google Cloud’s enterprise offering (what MoltBot uses for chat). The Gemini API (ai.google.dev) is the consumer API with generous free tier - 60 requests/minute for Gemini 1.5, includes image generation.
  • nano-banana-pro: Uses Gemini 3 Pro Image model for AI image generation/editing with natural language prompts.
  • Simpler setup: One API key (Gemini API) now powers both the chat LLM and image generation, instead of separate Vertex AI credentials.
  • Cost: Gemini API has a generous free tier (60 RPM for most models). Image generation with Gemini 3 Pro Image is also included.
  • nano-banana-pro: You can now ask MoltBot to “generate an image of…” or “edit this image to…” via WhatsApp!
  • Gemini 3 Flash is optimized for speed while maintaining strong reasoning. It’s ideal for quick Q&A, everyday tasks, and when low latency matters.
  • Cost difference: Flash models typically cost 5-10x less than Pro models per token.
  • When to use Pro: Deep reasoning, complex coding, nuanced analysis. Flash handles most daily tasks well.
  • Apple’s ecosystem is intentionally locked to macOS/iOS. These CLIs work by accessing local Apple databases and APIs that only exist on macOS.
  • If Apple integration is important, a Mac mini makes a great always-on MoltBot server (~$600, low power, quiet).
  • summarize uses yt-dlp under the hood for YouTube and can handle podcasts, articles, and local files
  • nano-pdf uses Gemini for understanding PDF content and making edits - needs your GEMINI_API_KEY (already configured)
  • video-frames leverages ffmpeg for precise frame extraction at specific timestamps
  • macOS removed the built-in Screen Sharing support for RDP years ago
  • Microsoft Remote Desktop is free and the most reliable option for Mac-to-Windows/Linux RDP
  • The open rdp:// URL scheme only works if an app registers to handle it
  • Microsoft rebranded “Remote Desktop” to “Windows App” in 2024
  • App Store installs don’t require sudo, making them easier for automated workflows
  • Homebrew casks for GUI apps often need elevated privileges for /Applications
  • Snap apps store configs in ~/snap/<app>/<version>/.config/ instead of ~/.config/
  • obsidian-cli expects the standard location, so the symlink bridges the gap
  • This is a common pattern when mixing snap-installed apps with CLI tools

openclaw

Parallel work on MoltBot under the openclaw project, sharing the same Gemini API migration and tool integrations as the tools project.

  • Vertex AI vs Gemini API: Vertex AI is Google Cloud’s enterprise offering (what MoltBot uses for chat). The Gemini API (ai.google.dev) is the consumer API with generous free tier - 60 requests/minute for Gemini 1.5, includes image generation.
  • nano-banana-pro: Uses Gemini 3 Pro Image model for AI image generation/editing with natural language prompts.
  • Simpler setup: One API key (Gemini API) now powers both the chat LLM and image generation, instead of separate Vertex AI credentials.
  • Cost: Gemini API has a generous free tier (60 RPM for most models). Image generation with Gemini 3 Pro Image is also included.
  • nano-banana-pro: You can now ask MoltBot to “generate an image of…” or “edit this image to…” via WhatsApp!
  • Gemini 3 Flash is optimized for speed while maintaining strong reasoning. It’s ideal for quick Q&A, everyday tasks, and when low latency matters.
  • Cost difference: Flash models typically cost 5-10x less than Pro models per token.
  • When to use Pro: Deep reasoning, complex coding, nuanced analysis. Flash handles most daily tasks well.
  • Apple’s ecosystem is intentionally locked to macOS/iOS. These CLIs work by accessing local Apple databases and APIs that only exist on macOS.
  • If Apple integration is important, a Mac mini makes a great always-on MoltBot server (~$600, low power, quiet).
  • summarize uses yt-dlp under the hood for YouTube and can handle podcasts, articles, and local files
  • nano-pdf uses Gemini for understanding PDF content and making edits - needs your GEMINI_API_KEY (already configured)
  • video-frames leverages ffmpeg for precise frame extraction at specific timestamps
  • macOS removed the built-in Screen Sharing support for RDP years ago
  • Microsoft Remote Desktop is free and the most reliable option for Mac-to-Windows/Linux RDP
  • The open rdp:// URL scheme only works if an app registers to handle it
  • Microsoft rebranded “Remote Desktop” to “Windows App” in 2024
  • App Store installs don’t require sudo, making them easier for automated workflows
  • Homebrew casks for GUI apps often need elevated privileges for /Applications
  • Snap apps store configs in ~/snap/<app>/<version>/.config/ instead of ~/.config/
  • obsidian-cli expects the standard location, so the symlink bridges the gap
  • This is a common pattern when mixing snap-installed apps with CLI tools