Bhavana AI

AI/ML insights

Dev Log: January 25, 2026

courses

Debugged a string manipulation bug in GPU type parsing, resubmitted high-load Gavel experiments that had timed out on SLURM, and studied the Gavel paper’s scheduling architecture including its effective throughput matrix and round-based time-sharing design.

The original bug was a classic string manipulation mistake: "36:36:36".replace(':', '.p100=') produces "36.p100=36.p100=36" - all colons are replaced in one call, leaving none for the second .k80= replacement. Always split structured strings into parts first, then reassemble.

The 15 incomplete experiments are all at the highest load points (lambda=1200 for fig9 = 3 jobs/hr, lambda=600 for fig8 = 6 jobs/hr). These are the most demanding simulations that likely exceeded the 4-hour Slurm time limit. The figure plots should still be meaningful since we have the full range of lower loads.

These high-load experiments (lambda=600-1200 = 3-6 jobs/hour) have to simulate more jobs arriving faster, causing longer simulation times. At 6 jobs/hour over 5000 job window, the simulator processes significantly more scheduling decisions. The 4-hour limit was insufficient; 8 hours should be enough for convergence.

At high load (lambda=1200 = 3 jobs/hour), job arrivals exceed completion capacity. The measurement window requires jobs 4000-5000 to complete, but under severe queueing, the simulator must process thousands of additional jobs before the window jobs finish. This is an inherent property of the experiment - at these load levels, the system is unstable.

Gavel’s Core Innovation: The key insight is the effective throughput matrix - a 2D array of (job_type × worker_type) throughputs that enables heterogeneity-aware scheduling. The ThroughputEstimator uses matrix completion to fill in missing entries, allowing Gavel to make informed decisions even for new job types.

Round-based Time-sharing: Jobs get leases (not GPU ownership). A 6-minute round lets the scheduler recompute allocations frequently, enabling fair sharing without requiring job-level coordination.

Key Architectural Decisions in the Code Path:

  1. Two-phase scheduling: First compute fractional allocations (policy), then discretize to workers (scheduler). This separates the optimization problem from bin-packing.

  2. Deficit-based priority: _priorities and _deficits track how much each job is “owed” vs “received,” enabling fair time-sharing across rounds.

  3. Mid-round recomputation: At SCHEDULE_RECOMPUTE_FRACTION=0.5 (3 minutes into a 6-minute round), the scheduler recalculates allocations. This lets it react to job completions without waiting for full round boundaries.

The Gavel evaluation has very specific experimental parameters that must be matched exactly for replication. The paper uses synthetic traces (not raw Philly data) with particular distributions.


claude-transcript-viewer

A big day for the transcript viewer. Diagnosed why semantic search was failing (the 0.6B embedding model was too small to capture semantic relationships), migrated to an 8B model with true GPU-batched re-embedding, fixed server startup conflicts, and added date range filtering and sort controls to both the landing page and project index pages.

Key Finding: Only 40% of chunks (278,379 out of 699,145) have embeddings! This explains why semantic search may not find related content - if the target conversation was indexed before embeddings were enabled, it won’t have vectors.

Root Cause Identified: The vector search IS working, but the semantic similarity scores are poor. For “travel planning”, the Milan conversation’s best chunk has distance 0.5886 vs the top result at 0.4956. For “italy”, Milan’s chunks are at 0.6240 vs top at 0.5256. The embeddings don’t capture the semantic relationship between “italy” and “Milan” (a city in Italy) or between “travel planning” and flight optimization.

Critical Finding: The Milan chunk “I’m trying to plan a trip to Milan and India” has these distances:

  • “travel planning” → 0.6454 (ranks 16,181st out of 278k chunks!)
  • “italy” → 0.8597 (extremely poor)
  • “milan” → 0.8012

The Qwen3 0.6B embedding model is NOT capturing semantic relationships. The chunk literally says “plan a trip” but “travel planning” query doesn’t match it well. This is a model quality issue.

Model Quality Problem Confirmed:

  • “italy trip” query prefers an UNRELATED project planning doc over a doc literally about “trip to Milan and India”!
  • Geographic relationships are nearly non-existent: “milan” ↔ “italy” similarity is only 0.26
  • “trip” ↔ “vacation” similarity is 0.14 - almost no semantic relationship

The Qwen3-Embedding-0.6B model is too small/undertrained for semantic search. This isn’t an implementation bug.

Second Problem Found - FTS Issues:

  1. “travel planning” and “italy” return NO FTS results
  2. The Milan conversation says “Milan and India” - the word “Italy” never appears!
  3. “planning” has 0 matches even though “plan a trip” is in the text - trigram tokenizer doesn’t do stemming

The 0.6B model is fundamentally too small for semantic search. It lacks the capacity to encode geographic relationships (Milan→Italy) and synonyms (trip→vacation). The 8B model has this knowledge baked in.

The re-embedding is a one-time cost. Once complete, your semantic search will correctly find:

  • “travel planning” → Milan trip conversation
  • “italy” → Content mentioning Milan, Rome, etc.
  • Synonyms like “trip” ↔ “vacation” will work properly

The vec0 virtual table extension requires integer rowids, but the chunks table’s IDs don’t match what we’re inserting. The reindex function needs to be fixed to handle this properly.

The problem: When chunks_vec is dropped and recreated with new dimensions, the existing UPDATE triggers try to insert into the new table. But the trigger’s INSERT might be passing values that sqlite-vec doesn’t like (possibly BigInt from better-sqlite3).

The current code relies on triggers to sync embeddings to chunks_vec, but this breaks when we drop and recreate the table mid-stream.

Batch size 200 is a balance between:

  • Memory usage - larger batches need more GPU memory
  • Throughput - larger batches are more efficient for MLX
  • Error recovery - if a batch fails, we lose less progress

For the 8B model which uses more memory than 0.6B, you might want to reduce this if you see OOM errors.

The current “batch” implementation provides zero GPU parallelism. True batching would:

  • Pad sequences to same length
  • Process all texts in one matrix operation
  • Get 5-10x speedup on GPU

The biggest win is true batching. Previously, batch size 200 meant 200 sequential GPU calls. Now it’s 1 GPU call processing 200 sequences in parallel. Combined with JIT compilation, expect 5-15x overall speedup on batch operations.

Codex identified that our “optimized” batch still has batch_size synchronizations per batch instead of 1. Combined with the O(n²) SQLite pagination, we’re leaving massive performance on the table. The vectorized pooling alone could give 10-50x speedup on the GPU side.

The key insight from Codex: our “optimized” batch code was still doing batch_size GPU syncs per batch. The new vectorized path does exactly 1 sync per batch regardless of size. Combined with keyset pagination, the 700k chunk re-embedding should be dramatically faster.

The npx command failed silently to start its own server because port 3000 was already in use by the old server process. The indexing ran successfully (writing to ~/projects/claude-archive/.search.db), but you’re viewing the old server’s stale data.

Single Instance Detection Pattern:

  1. Before binding to the port, we create a temporary test server on that port
  2. If binding fails with EADDRINUSE, we know another process is using it
  3. We use lsof and ps (via spawnSync for security) to identify the blocking process
  4. Clear error messages tell the user exactly how to resolve the conflict

Auto-replacement logic:

  1. Detects if port is in use
  2. Checks if the blocking process is a transcript-viewer instance (via command string matching)
  3. If yes: sends SIGTERM, waits for graceful shutdown, falls back to SIGKILL if needed
  4. If no (some other app): shows error with manual kill instructions
  5. This prevents accidentally killing unrelated processes while making restarts seamless

Current State Analysis:

  • Debouncing already exists (200ms) but may need to be increased for better UX
  • Backend already supports after/before date parameters - just needs UI
  • Sort order requires changes to both API and UI - currently hardcoded to descending

Implementation Summary:

  1. Search on Enter only - Removed debounce/auto-search entirely. Both the landing page and conversation page search bars now only trigger search when Enter is pressed. This is cleaner and avoids unnecessary API calls.

  2. Date range filtering - Added after and before query params to the landing page. The SQL query now filters conversations by created_at date. The “before” date includes the entire day (appends T23:59:59).

  3. Sort order - Added sort param (asc/desc) to control chronological ordering. Default is desc (newest first). The UI shows “Newest first” / “Oldest first” dropdown.

Looking at the code structure, I now understand the issue. The user wants date filtering on the project index pages (like /tools/) that list all conversations within a project. These pages currently serve static HTML from the archive. I need to make them dynamic with filtering/sorting capabilities, similar to the landing page.

Key changes:

  1. The /:project/ route is now dynamic instead of serving static HTML
  2. It queries the database for conversations matching the project
  3. Supports after, before, and sort query parameters
  4. Falls back to static HTML if no database entries exist for a project

What was implemented:

  1. Updated parser to extract timestamp from each JSONL message
  2. Added timestamp column to the chunks database table
  3. Created /api/conversations/:id/timestamps endpoint
  4. Added date filter UI (From/To datetime pickers + Apply/Clear buttons) to conversation pages
  5. Client-side JS fetches timestamps and filters messages based on date range

podcast-summarizer-v2

Designed and implemented batch processing for podcast episode summaries using Azure OpenAI, with an eager fallback pattern for failed batches and atomic claim logic to prevent duplicate submissions.

Why eager fallback fits your architecture:

  • Your design constraint is “one summary per episode” - so a failed batch attempt just means the summary doesn’t exist yet
  • CPU processor already has the sync LLM path (for validation) - reusing it for failures is zero additional code
  • The failure tracking happens at the episode level (summary_batch_job_id gets cleared), making retries automatic

Why episode-centric batching is the right choice:

  • Summaries are naturally per-episode (not per-delivery), so tracking batch status there avoids duplication
  • The eager fallback (batch_failed flag) creates a clean escape hatch without complicating delivery state
  • Testing with old episodes (past TTL) is a clever way to validate the pipeline in complete isolation

Why these fixes matter:

  • Atomic claim prevents wasted Azure batch submissions (and money) from duplicate episodes
  • Idempotent processing means crashes are safe - just rerun, no manual cleanup
  • Order-of-writes prevents delivery workers from sending emails with missing summaries
  • Tiered fallback avoids the “one transient Azure hiccup = full-price sync for everyone” problem