Bhavana AI

AI/ML insights

Dev Log: February 12, 2026

short-projects

Continued building out the housing map with driving time enrichment and hex-based aggregation. Started by integrating OSRM for click-triggered drive time calculations, then pre-computed drive times for all 32K sales using batched OSRM requests (which took about 30 minutes on the free demo server). Added H3 hexagonal aggregation for neighborhood-level price summaries, switched to Google Maps with traffic-aware commute times, and deployed the final build with proper relative asset paths and map bounds.

Using OSRM’s public demo server (router.project-osrm.org) for driving time calculations. The /table endpoint computes a duration matrix in one request — we send the clicked home as source and all POIs as destinations. This avoids N separate route requests. OSRM returns durations in seconds; we divide by 60 for minutes. The public server has no API key requirement but has rate limits, which is fine for click-triggered queries.

The SalePopup component uses a useRef guard (fetched.current) to ensure the OSRM API call happens exactly once when the popup opens, not on every render. This avoids the issue of React’s strict mode double-rendering causing duplicate API calls. The OSRM /table endpoint is used instead of multiple /route calls — it sends one request with the clicked home as source and all 9 POIs (8 gyms + Building 43) as destinations, returning all driving durations in a single response.

The Tooltip was simplified to a compact one-liner (price, county, beds) since the full details now show in the click Popup alongside the driving times.

To make drive time filtering instant (not per-click), we need to pre-compute drive times for all 32K sales during data processing. The OSRM /table endpoint accepts many sources with few destinations efficiently. With 9 POI destinations, we can batch ~200 sales per request (200 sources x 9 destinations = 1800 cells per call). That’s ~163 batches at 0.3s delay, roughly 1 minute total. The result is two new fields per sale: driveGym and driveOffice (minutes), baked into the JSON.

Pre-computing drive times for 32K sales took ~30 minutes using the free OSRM demo server (325 batches of 100 at ~6s each). The key tradeoffs:

  • Python 3.9 + LibreSSL can’t do HTTPS to OSRM — had to use HTTP. This is a known issue with older Python + LibreSSL combinations where the server requires TLS 1.3.
  • Batch size of 100 — each batch has 100 source coords + 9 POI destination coords = 109 coordinates in the URL. Larger batches hit URL length limits or timeouts.
  • The median drive to nearest gym is 18 min and to Building 43 is 32 min — these are realistic drive times that validate the OSRM results.

The filter works as a “max minutes” input rather than a range, since you only care about “show me homes within X minutes” — you never want to filter OUT homes that are too close.

H3 divides the earth into hexagons at 16 resolutions (0-15). Each step up is roughly 7x smaller in area. The key insight: you can always go from a finer hex to its coarser parent (res-8 hex “belongs to” a res-7 hex), which lets you do analysis at different scales cheaply.

This is a classic “compute on read vs. compute on write” trade-off. Pre-computing (write-time) is efficient when queries are predictable. But with combinatorial filters, computing at read-time (in the browser) is simpler and always correct. Modern browsers can group and aggregate 32K records in milliseconds.

H3 v4 renamed everything from v3. The key functions: h3.latlng_to_cell(lat, lng, res) (was geo_to_h3), h3.cell_to_latlng(cell) (was h3_to_geo), and h3.cell_to_parent(cell, res) (was h3_to_parent). The parent relationship is deterministic — every res-8 hex belongs to exactly one res-7 hex. This is what makes the two-tier routing approach work.

h3-js’s cellToBoundary() returns [lat, lng] pairs (6 vertices for a hexagon). Conveniently, Leaflet’s Polygon also expects [lat, lng] arrays — so the output can be passed directly without coordinate swapping. This is a nice API alignment that avoids a common bug.

The fillOpacity formula Math.min(0.85, 0.4 + count * 0.05) creates a visual encoding where hexes with more sales (higher confidence) appear more opaque. A hex with 1 sale is faint (0.45), while one with 9+ sales is nearly solid (0.85). This gives a subtle visual cue about data density — important because a hex with 1 sale is much less reliable than one with 20.

We estimated ~1,000-1,200 hexes in planning but got 969. The real-world count is lower because the data isn’t uniformly distributed — dense urban areas collapse into fewer hexes, and some parts of the geo bounds have no sales at all. This means even lower API costs than projected.

The script uses duration_in_traffic instead of duration — this is the key field that only appears when you set departure_time. Without it, Google Maps returns the same free-flow estimates as OSRM. The departure_time must be in the future; we use next Tuesday 8 AM Pacific to model a typical morning commute. The cache-after-each-batch pattern means if the script crashes mid-run, you only lose one batch of work.

  • maxBounds defines a bounding box the user can’t pan outside of. Without maxBoundsViscosity, the map “bounces back” when you try to pan past the edge. Setting it to 1.0 makes the boundary completely rigid — the map stops immediately at the edge instead of rubber-banding.
  • minZoom: 9 is chosen to match the DEFAULT_ZOOM — at zoom level 9, King + Snohomish County fills the viewport nicely. Users can zoom in but never zoom out past this level.

Vite’s base: './' makes all asset paths relative (e.g., ./assets/index.js instead of /assets/index.js). This is essential when deploying to a subfolder — without it, the browser would look for assets at the domain root. The same applies to runtime fetch() calls in application code — those aren’t handled by Vite’s build, so we must manually change /sales_data.json to ./sales_data.json.


courses

Focused on GPU sharing visualization and experiment infrastructure for the Gavel scheduler. Built a compact allocation logging format to replace verbose per-job lines, fixed a GPU sharing bug with dict vs. set operations, implemented a sparse overlay binary format for visualizing shared GPUs as 2x2 sub-cells, and launched SLURM experiment sweeps across multiple load levels.

The viz pipeline has a clever accumulation pattern: [Micro-task scheduled] lines are collected between TELEMETRY lines, so each round’s heatmap snapshot is the sum of all microtask allocations since the previous TELEMETRY marker. Missing either line type breaks the heatmap. Without -q, we get full INFO-level output at the cost of ~200ms/round overhead from microtask logging at Alibaba scale (~1700 active jobs).

lam is inter-arrival time in seconds, not jobs-per-hour. The conversion lam = 3600 / jph explains why 60 jph = lam 60.0, 180 jph = lam 20.0, 360 jph = lam 10.0. This is a Poisson process parameter where smaller lam means higher load.

Telemetry requires no -q flag. The viz pipeline needs three INFO-level log line types: TELEMETRY (round metrics), EVENT (job arrival/completion), and [Micro-task scheduled] (per-job GPU allocation). These are accumulated between rounds to build heatmap snapshots. The trade-off is ~200ms/round overhead at Alibaba scale, but that’s necessary for full graph population.

The performance vs. observability trade-off: Emitting 1700+ individual [Micro-task scheduled] INFO lines per round costs ~200ms/round. The solution is to emit a single compact ALLOCATION JSON line per round at INFO level — all the same data in one line instead of 1700+. This preserves the performance gain while giving the viz tool what it needs for heatmap rendering.

How the viz pipeline accumulates allocations: Between two TELEMETRY lines, allocation lines are accumulated into current_allocations[worker_id] = job_id. When the next TELEMETRY arrives, the accumulated snapshot is saved as that round’s state, then reset. A single compact line works identically.

Why a compact ALLOCATION line beats re-promoting individual logs: The original [Micro-task scheduled] lines emitted one line per job per round (~1700 lines at Alibaba scale). That was demoted to DEBUG because it added ~200ms/round in string formatting + I/O overhead. The new compact ALLOCATION JSON line contains the same {job_id: [worker_ids]} mapping in a single line with separators=(',',':') for minimal size. One json.dumps() call + one I/O write replaces 1700+ format() calls + 1700+ writes.

The preprocess pipeline works identically: parse_allocation_bulk() returns a list of {job_id, worker_ids} dicts in the same shape as parse_allocation(). The accumulation logic in preprocess_viz.py sets current_allocations[worker_id] = job_id the same way, just from a single line instead of many. Backward compatibility is preserved — old [Micro-task scheduled] lines at DEBUG still work too.

The bug: GPU sharing uses a dict for assigned_worker_ids (tracking fractional capacity per GPU), but _assign_workers_to_job() at line 858-860 uses set operations (not in and .add()). The not in check works on dicts (checks keys), but .add() is set-only. The fix needs to handle both container types: for the dict path, we need to check remaining capacity and update the fractional usage instead of a simple add/membership test.

Important limitation to be aware of: The current heatmap encoding stores one job ID per GPU cell (current_allocations[worker_id] = job_id). With GPU sharing enabled, if 2-4 jobs share a single physical GPU, only the last job written will appear in the heatmap cell. The timeseries charts (utilization, jobs running, etc.) are unaffected since they come from telemetry counters. If you want the heatmap to visually distinguish shared GPUs, the binary format and renderer would need to support multiple job IDs per cell — that would be a separate enhancement.

Architecture overview: This implementation adds GPU sharing visualization using a sparse overlay pattern. Rather than changing the existing v1 format (which stores 1 job per GPU), we add a new optional section that only stores data for GPUs with multiple tenants. This means:

  1. Zero overhead for non-sharing experiments — the sharing section is empty/absent
  2. Backward compatibility — v1 files render identically since the renderer fills all 4 quadrants with the primary job when no sharing data exists
  3. Sparse storage — only ~7.6% of GPUs need entries (at Alibaba scale), keeping file size manageable

Binary format versioning strategy: The v1 header is 64 bytes packed into a 256-byte buffer (192 bytes of zero-padding). We can add new fields at byte offsets 64+ without breaking v1 readers — they simply ignore those bytes. v2 readers check version >= 2 and read the new offsets. This is the same pattern used by many binary formats (PNG chunks, ELF sections) for forward-compatible extensibility.

Quarter-slot allocation model: The key design choice here is representing each GPU as 4 quarter-slots ([0,0,0,0]). When a job with gpu_request=0.25 is placed, it fills 1 slot. A gpu_request=0.50 job fills 2 slots. This is a “bin-packing within a GPU” model. The slots are filled greedily from the first empty position, which means the order is deterministic and consistent across rounds. For the primary allocations array (v1 compat), we use the first non-zero slot — so a GPU with jobs A and B still reports A as the “primary” job.

Rendering strategy for sub-cell grids: At typical zoom levels (cell size 6-28px), each cell is split into a 2x2 grid of “quarter-cells” with a 1px gap between them. This creates a clear visual distinction between shared and non-shared GPUs. At very small cell sizes (2-3px), we fall back to solid color since sub-pixel rendering wouldn’t be meaningful. The key insight is that for non-shared GPUs (no sharing map entry), all 4 quarters get the same color, which visually appears as a solid cell with internal grid lines — so even with sharing data loaded, non-shared GPUs look nearly identical to the old rendering.

Summary of the GPU sharing visualization implementation:

  1. Data pipeline flow: Scheduler emits gpu_request in EVENT —> log parser extracts it —> preprocessor tracks quarter-slots —> binary writer encodes sparse sharing section —> JS decoder reads it —> renderer draws 2x2 sub-cells

  2. Backward compatibility is achieved through three mechanisms:

    • The v2 header extension lives in the previously-unused zero-padded region (bytes 64-79), so v1 files are structurally compatible
    • The JS decoder gates v2 reading behind version >= 2 and defaults sharing offsets to 0
    • The heatmap renderer only draws 2x2 quadrants when sharingMap is non-null AND cell size >= 4px
  3. Storage efficiency: Only GPUs with mixed occupancy get sharing entries (typically ~7.6% at Alibaba scale). The sparse overlay adds ~38MB over 5,000 rounds at that scale — manageable alongside the ~124MB base.

Alibaba GPU sharing distribution: 10% of jobs request 0.25 GPU, 16% request 0.50, and the remaining 74% request whole GPUs (1-8). This means ~26% of jobs use fractional GPUs — exactly the scenario our new visualization handles.

The enable_gpu_sharing flag is critical: Even with the Alibaba workload generator producing gpu_request=0.25/0.50 values, the scheduler only uses gpu_request for allocation decisions when enable_gpu_sharing=True. Without it, all jobs are allocated full GPUs regardless of their gpu_request. This is a deliberate design choice — GPU sharing changes the scheduling dynamics significantly, so it’s opt-in.

The workload_mode controls which throughputs file and scale factor generator are used. Alibaba mode uses simulation_throughputs_alibaba.json which maps job types to Alibaba GPU types (G2, T4, etc.), not k80/p100/v100. The fgd_workload_mode is a separate axis that controls FGD placement behavior. To get fractional GPU requests, we need workload_mode: "alibaba" which means the cluster must also use Alibaba GPU types.

The sparse overlay approach pays off here: only ~1.4 GPUs per round need sharing entries (out of 100 total GPUs), so the sharing section is just ~184KB on top of a 3.4MB file. For the Alibaba-scale cluster (6200 GPUs), this would still be manageable — the sharing data grows linearly with shared GPU count, not total GPU count.

The existing SLURM script needs two additions: --save-logs (so we get simulation logs for viz preprocessing) and --max-wall-time (for graceful exit before SLURM kills the job). Previous 60 jph experiments hit OOM/timeout before our optimizations (0.64s/round). With those fixes, the experiments should complete, but a safety margin is still wise.

The launch script uses sbatch --dependency=afterany:$PREV_JOB_ID to chain load levels. afterany (not afterok) means the next load level starts even if some experiments in the previous level failed or timed out — we don’t want a single failure at 60 jph to block all higher loads. Each load level is a SLURM array job with 9 tasks that run in parallel.


bhavanaai

Set up the automated devlog generation pipeline, including insight extraction and the launchd scheduling agent.

The insight regex needs to handle the markdown backtick-wrapped delimiters. In the JSONL, insights appear as `★ Insight ───...` with backticks around the delimiter lines. The regex accounts for this by matching the content between the opening and closing delimiter patterns.

The launchd StartCalendarInterval with just Hour and Minute keys fires daily at that time. Unlike cron, launchd will run missed jobs when the machine wakes up if it was asleep at midnight, so you won’t lose days when your laptop is closed overnight.

The split between deterministic extraction (Python script for verbatim insights) and LLM generation (Claude CLI for stitching the post together) is a good pattern for any pipeline where some content must be exact and other content needs synthesis. The LLM handles the parts that benefit from intelligence (summaries, descriptions, formatting decisions) while the script guarantees fidelity for the parts that must not be altered.