Dev Log: January 24, 2026
courses
Continued the Gavel paper annotation project, tracing key concepts from paper sections to their code implementations. Reviewed the base constraint definitions in policy.py, the max-min fairness formulation in max_min_fairness.py, and validated annotations against the paper using Codex. Ran sanity checks confirming the CVXPY solver produces correct allocations, then launched large-scale experiment runs for Figures 8, 9, and 12 on the Farmshare cluster, with 270 total experiments running in parallel.
The policy.py file contains the base constraint definitions for all Gavel policies. The key paper concepts are in get_base_constraints:
x >= 0maps to Constraint (1): allocation values must be non-negativecp.sum(x, axis=1) <= 1maps to Constraint (2): each job’s total time share <= 1- The capacity constraint maps to Constraint (3): worker capacity limits
The max_min_fairness.py implements the LAS (Least Attained Service) policy from §4.1. Key observations:
- The objective
cp.Maximize(cp.min(...))is the max-min fairness formulation - Throughputs are normalized by
proportional_throughputswhich serves as the X^equal baseline - Priority weights
1/w_mare applied to implement weighted fairness - Scale factors adjust for multi-GPU jobs as described in the paper
The annotation project traces Gavel’s key concepts from paper to implementation:
- §3.1 defines the allocation matrix X and constraints (policy.py)
- §4.x policies solve different optimization objectives
- §5 Algorithm 1 is implemented across scheduler.py (priority computation, job placement)
- §6 throughput estimation enables heterogeneity awareness
The Codex review caught subtle but important distinctions:
- LAS vs max-min: LAS tracks cumulative service over time; static max-min doesn’t
- X^equal: The paper defines X^equal as an allocation; throughput(m, X^equal) is what normalizes the objective
- Themis: Minimizes max ratio ρ, which is different from minimizing max completion time
The sanity check confirms:
- CVXPY solver (ECOS) is now working
- Allocation matrix is being computed correctly (jobs get heterogeneous allocations)
- Throughput-based scheduling works (v100 preferred for compute-intensive jobs)
- Average JCT: 933s for 20 jobs on a 36:36:36 cluster
Figure 12 Results - Policy Runtime Scaling:
| Jobs | max_min_fairness_perf | water_filling |
|---|---|---|
| 32 | 3.5ms | 4.9ms |
| 64 | 3.4ms | 4.6ms |
| 128 | 4.8ms | 6.2ms |
| 256 | 8.1ms | 8.6ms |
This confirms the paper’s claim that policy runtime scales efficiently with job count.
Current Figure Reproduction Status:
| Figure | Description | Status | Experiments |
|---|---|---|---|
| Fig 8 | JCT vs Load (single-GPU) | Running | 180 (4 policies × 3 seeds × 15 λ points) |
| Fig 9 | JCT vs Load (multi-GPU) | Running | 90 (3 policies × 3 seeds × 10 λ points) |
| Fig 12 | Policy runtime scaling | ✅ Done | Confirms O(n) scaling |
The key figure for validation is Figure 9 - it shows that Gavel’s heterogeneity-aware policies (max_min_fairness_perf) achieve lower JCT than baselines (gandiva) at high cluster load.
Full Figure 9 requires: 90 experiments (3 policies × 3 seeds × 10 λ values) Full Figure 8 requires: 180 experiments (4 policies × 3 seeds × 15 λ values)
Each experiment is independent - perfect for parallel execution.
Current Experiment Status:
| Figure | Workers | Running | Progress |
|---|---|---|---|
| Fig 9 | 4 processes × ~4min CPU each | ✓ | Simulating multi-GPU jobs |
| Fig 8 | 4 processes × ~2min CPU each | ✓ | Simulating single-GPU jobs |
Log Format: The Gavel plotting infrastructure already exists in notebooks/figures/evaluation/. The logs follow the pattern:
logs/fig9/raw_logs/v100=36.p100=36.k80=36/{policy}/seed={s}/lambda={λ}.log
The utils.py file can parse these logs to extract:
- Average JCT:
average_jct_fn() - Makespan:
makespan_fn() - JCT distribution:
get_jcts()
Current Status:
- Job ID: 1345550
- Running: ~128 experiments in parallel on
barley-01 - Pending: 142 more (waiting due to per-user job limit)
- Total: 270 experiments (90 for Fig 9 + 180 for Fig 8)
Estimated completion: ~4 hours (each job takes ~2-3 hours, running in batches)
Monitor with:
ssh farmshare "squeue -u vramesh3 | wc -l" # Count pending+running
ssh farmshare "ls ~/gavel/cluster/results/*/*/*/*.log | wc -l" # Count completed
When done, aggregate results:
ssh farmshare "cd ~/gavel/cluster && python3 aggregate_results.py"
podcast-summarizer-v2
A deep debugging session on the KQL-backed observability features. Traced the missing historical data from the dashboard through multiple layers: first discovered the Log Analytics workspace ID wasn’t configured in the Bicep deployment, then found the KQL query filters were wrong (log entries have a Python logger prefix before [METRIC]), and finally fixed KQL syntax issues in join/project clauses and JSON parsing. Also removed the misleading System Health section and deployed Azure Workbook infrastructure for proper KQL-native monitoring.
The code shows that kql_available is set to False when BOTH historical_raw and queue_raw return empty dicts. This happens when:
log_analytics_workspace_idis not configured (line 44-46)- The
azure-monitor-querySDK isn’t installed (line 54-55) DefaultAzureCredentialfails to authenticate (line 57-59)- The KQL query itself fails (line 111-113)
This is a common deployment gap: the feature code was added (historical.py), but the infrastructure wasn’t updated to provide the required configuration. The code gracefully handles this by setting kql_available=False, which triggers the warning banner you’re seeing.
Why did systematic debugging help here?
- Instead of guessing (SDK issue? Auth issue? Network issue?), we traced the data flow: Dashboard →
kql_available=false→_fetch_with_fallbackreturns{}→get_log_analytics_client()returnsNone→log_analytics_workspace_idis empty string → env var not configured in Bicep - The fix was a 2-line infrastructure change, not a code change - guessing would have led us down the wrong path
The logs reveal TWO different issues:
- The first query has invalid properties (likely the KQL query syntax or timespan format)
- The second query has a Python parsing error (the SDK response format differs from what we expected)
Root cause identified: The KQL queries use startswith "[METRIC]" but the actual logs have a Python logger prefix before [METRIC]. This means ZERO rows match the filter, causing the query to return empty results, which then causes the parsing error ('str' object has no attribute 'name') when trying to process an empty/error response.
The debugging journey revealed three layers of issues:
- Infrastructure: Missing env var (config not passed to container)
- Query filter: Wrong assumption about log format (Python logger adds prefix)
- Query syntax: Invalid KQL join/project syntax (not valid in Azure KQL)
Each fix only revealed the next issue - classic “peeling the onion” debugging.
The KQL logs show the truth:
- GPU transcriber last ran: ~15 hours ago (2026-01-24T00:12:54Z)
- CPU processor last ran: ~15 hours ago (2026-01-24T00:15:17Z)
But the dashboard shows “28 days ago” because that’s when someone last clicked “Run Now”, not when the job actually ran.
The System Health section relies on last_*_run timestamps stored in BatchState, but these only update when manually triggering jobs via “Run Now” buttons. The cron-triggered jobs don’t update these timestamps - they emit metrics to Log Analytics instead. The KQL-backed 7-day success rates give a more accurate picture of system health.
Why we removed System Health instead of fixing it:
- The timestamps came from
BatchStatewhich only updated on manual “Run Now” clicks - Cron-triggered jobs emit metrics to Log Analytics, not to BatchState
- Fixing this properly would require either:
- Jobs writing timestamps on every cron run (adds latency + DB writes)
- KQL queries for last job timestamps (complex, different data source)
- The 7-day success rates from KQL already tell you if jobs are healthy
Azure Workbooks vs Dashboards:
- Workbooks (
Microsoft.Insights/workbooks) are Log Analytics-native, support KQL queries, and are linked to a workspace - Dashboards (
Microsoft.Portal/dashboards) are Azure Portal dashboards with pinned tiles from various sources - Workbooks are better for KQL-heavy observability since they can reference the workspace directly and support interactive parameters
The log format issue:
Expected: "[METRIC] {"event": ...}" → substring(Log_s, 9) works
Actual: "2026-01-24 15:05:27 - [METRIC] {"event": ...}" → need indexof
Using indexof(Log_s, "{") finds the JSON start dynamically regardless of timestamp length. This is the same fix we made to src/api/admin/kql/queries.py earlier.
JSON escape sequences:
\\nin JSON = literal backslash + n (two characters:\andn)\nin JSON = actual newline character
In KQL embedded in JSON, you need \n for line breaks within the query string.
KQL bin() column naming:
// Auto-named column (hard to reference)
summarize count() by bin(timestamp, 1d)
// Explicitly named column (easy to reference)
summarize count() by ts = bin(timestamp, 1d)
| project ts, ... // works!
Always name bin() columns explicitly when you need to reference them later.
claude-transcript-viewer
Set up CI/CD publishing via GitHub Actions with npm trusted publishing, and configured the embedding server architecture for semantic search. Worked through npm 2FA blockers for automated publishing, switching to granular access tokens and then OIDC-based trusted publishing. Also explored the embedding dimension trade-offs and ONNX model conversion for the search backend.
Version-gated publishing: The workflow checks if the version in package.json differs from npm before publishing. This prevents duplicate publish attempts and lets you control releases by bumping the version.
Registry URL matters: The registry-url option in setup-node configures npm to authenticate with that registry, enabling the NODE_AUTH_TOKEN env var to work.
npm 2FA and CI/CD: Your npm account has 2FA enabled, which blocks automated publishes with regular tokens. You need an automation token (also called “granular access token” or “publish token”) that bypasses OTP for CI workflows.
Granular npm tokens: Unlike classic tokens, granular access tokens can bypass 2FA/passkey requirements for automation. They’re scoped to specific packages and permissions, making them ideal for CI/CD.
npm pkg fix: Automatically corrects common package.json issues like path normalization (./dist → dist) and URL formats.
Trusted Publishing uses OpenID Connect (OIDC) to let GitHub Actions authenticate directly with npm - no stored tokens needed. GitHub proves the workflow’s identity to npm, and npm grants temporary publish permissions. This is more secure because:
- No long-lived secrets to leak
- Permissions are scoped to specific repos/workflows
- Automatic token rotation on every run
Two-path usage: The viewer takes two optional paths:
- Archive path (required): Where the HTML transcripts live (generated by
claude-code-transcripts) - Source path (optional): Where the raw JSONL files are (
~/.claude/projects) - enables search indexing
The viewer uses a client-server architecture for embeddings:
- Viewer (what you installed) - handles indexing, search, and the web UI
- Embedding server (separate) - converts text to vectors for semantic search
This separation allows using different embedding backends (local MLX models, remote APIs, etc.) without changing the viewer.
Embedding dimension isn’t a quality measure - it’s a trade-off. Smaller dimensions (384) are faster to compare and store less data. Larger dimensions (768-1024) can capture more nuance but have diminishing returns. For transcript search, 384-768 is plenty.
ONNX is a standardized format for ML models. Converting from MLX/PyTorch to ONNX preserves the model’s learned weights - the math is identical, just executed by a different engine. Minor floating-point differences (1e-6) may occur but don’t affect search quality.
The implementation uses a process manager pattern - Node.js spawns Python as a child process and communicates via HTTP. This is common for polyglot applications where you want to leverage the best tool for each job (MLX for ML, Node.js for web serving).