Dev Log: January 24, 2026

courses

Continued the Gavel paper annotation project, tracing key concepts from paper sections to their code implementations. Reviewed the base constraint definitions in policy.py, the max-min fairness formulation in max_min_fairness.py, and validated annotations against the paper using Codex. Ran sanity checks confirming the CVXPY solver produces correct allocations, then launched large-scale experiment runs for Figures 8, 9, and 12 on the Farmshare cluster, with 270 total experiments running in parallel.

The policy.py file contains the base constraint definitions for all Gavel policies. The key paper concepts are in get_base_constraints:

x >= 0 maps to Constraint (1): allocation values must be non-negative
cp.sum(x, axis=1) <= 1 maps to Constraint (2): each job’s total time share <= 1
The capacity constraint maps to Constraint (3): worker capacity limits

The max_min_fairness.py implements the LAS (Least Attained Service) policy from §4.1. Key observations:

The objective cp.Maximize(cp.min(...)) is the max-min fairness formulation
Throughputs are normalized by proportional_throughputs which serves as the X^equal baseline
Priority weights 1/w_m are applied to implement weighted fairness
Scale factors adjust for multi-GPU jobs as described in the paper

The annotation project traces Gavel’s key concepts from paper to implementation:

§3.1 defines the allocation matrix X and constraints (policy.py)
§4.x policies solve different optimization objectives
§5 Algorithm 1 is implemented across scheduler.py (priority computation, job placement)
§6 throughput estimation enables heterogeneity awareness

The Codex review caught subtle but important distinctions:

LAS vs max-min: LAS tracks cumulative service over time; static max-min doesn’t
X^equal: The paper defines X^equal as an allocation; throughput(m, X^equal) is what normalizes the objective
Themis: Minimizes max ratio ρ, which is different from minimizing max completion time

The sanity check confirms:

CVXPY solver (ECOS) is now working
Allocation matrix is being computed correctly (jobs get heterogeneous allocations)
Throughput-based scheduling works (v100 preferred for compute-intensive jobs)
Average JCT: 933s for 20 jobs on a 36:36:36 cluster

Figure 12 Results - Policy Runtime Scaling:

Jobs	max_min_fairness_perf	water_filling
32	3.5ms	4.9ms
64	3.4ms	4.6ms
128	4.8ms	6.2ms
256	8.1ms	8.6ms

This confirms the paper’s claim that policy runtime scales efficiently with job count.

Current Figure Reproduction Status:

Figure	Description	Status	Experiments
Fig 8	JCT vs Load (single-GPU)	Running	180 (4 policies × 3 seeds × 15 λ points)
Fig 9	JCT vs Load (multi-GPU)	Running	90 (3 policies × 3 seeds × 10 λ points)
Fig 12	Policy runtime scaling	✅ Done	Confirms O(n) scaling

The key figure for validation is Figure 9 - it shows that Gavel’s heterogeneity-aware policies (max_min_fairness_perf) achieve lower JCT than baselines (gandiva) at high cluster load.

Full Figure 9 requires: 90 experiments (3 policies × 3 seeds × 10 λ values) Full Figure 8 requires: 180 experiments (4 policies × 3 seeds × 15 λ values)

Each experiment is independent - perfect for parallel execution.

Current Experiment Status:

Figure	Workers	Running	Progress
Fig 9	4 processes × ~4min CPU each	✓	Simulating multi-GPU jobs
Fig 8	4 processes × ~2min CPU each	✓	Simulating single-GPU jobs

Log Format: The Gavel plotting infrastructure already exists in notebooks/figures/evaluation/. The logs follow the pattern:

logs/fig9/raw_logs/v100=36.p100=36.k80=36/{policy}/seed={s}/lambda={λ}.log

The utils.py file can parse these logs to extract:

Average JCT: average_jct_fn()
Makespan: makespan_fn()
JCT distribution: get_jcts()

Current Status:

Job ID: 1345550
Running: ~128 experiments in parallel on barley-01
Pending: 142 more (waiting due to per-user job limit)
Total: 270 experiments (90 for Fig 9 + 180 for Fig 8)

Estimated completion: ~4 hours (each job takes ~2-3 hours, running in batches)

Monitor with:

ssh farmshare "squeue -u vramesh3 | wc -l"  # Count pending+running
ssh farmshare "ls ~/gavel/cluster/results/*/*/*/*.log | wc -l"  # Count completed

When done, aggregate results:

ssh farmshare "cd ~/gavel/cluster && python3 aggregate_results.py"

podcast-summarizer-v2

A deep debugging session on the KQL-backed observability features. Traced the missing historical data from the dashboard through multiple layers: first discovered the Log Analytics workspace ID wasn’t configured in the Bicep deployment, then found the KQL query filters were wrong (log entries have a Python logger prefix before [METRIC]), and finally fixed KQL syntax issues in join/project clauses and JSON parsing. Also removed the misleading System Health section and deployed Azure Workbook infrastructure for proper KQL-native monitoring.

The code shows that kql_available is set to False when BOTH historical_raw and queue_raw return empty dicts. This happens when:

log_analytics_workspace_id is not configured (line 44-46)
The azure-monitor-query SDK isn’t installed (line 54-55)
DefaultAzureCredential fails to authenticate (line 57-59)
The KQL query itself fails (line 111-113)

This is a common deployment gap: the feature code was added (historical.py), but the infrastructure wasn’t updated to provide the required configuration. The code gracefully handles this by setting kql_available=False, which triggers the warning banner you’re seeing.

Why did systematic debugging help here?

Instead of guessing (SDK issue? Auth issue? Network issue?), we traced the data flow: Dashboard → kql_available=false → _fetch_with_fallback returns {} → get_log_analytics_client() returns None → log_analytics_workspace_id is empty string → env var not configured in Bicep
The fix was a 2-line infrastructure change, not a code change - guessing would have led us down the wrong path

The logs reveal TWO different issues:

The first query has invalid properties (likely the KQL query syntax or timespan format)
The second query has a Python parsing error (the SDK response format differs from what we expected)

Root cause identified: The KQL queries use startswith "[METRIC]" but the actual logs have a Python logger prefix before [METRIC]. This means ZERO rows match the filter, causing the query to return empty results, which then causes the parsing error ('str' object has no attribute 'name') when trying to process an empty/error response.

The debugging journey revealed three layers of issues:

Infrastructure: Missing env var (config not passed to container)
Query filter: Wrong assumption about log format (Python logger adds prefix)
Query syntax: Invalid KQL join/project syntax (not valid in Azure KQL)

Each fix only revealed the next issue - classic “peeling the onion” debugging.

The KQL logs show the truth:

GPU transcriber last ran: ~15 hours ago (2026-01-24T00:12:54Z)
CPU processor last ran: ~15 hours ago (2026-01-24T00:15:17Z)

But the dashboard shows “28 days ago” because that’s when someone last clicked “Run Now”, not when the job actually ran.

The System Health section relies on last_*_run timestamps stored in BatchState, but these only update when manually triggering jobs via “Run Now” buttons. The cron-triggered jobs don’t update these timestamps - they emit metrics to Log Analytics instead. The KQL-backed 7-day success rates give a more accurate picture of system health.

Why we removed System Health instead of fixing it:

The timestamps came from BatchState which only updated on manual “Run Now” clicks
Cron-triggered jobs emit metrics to Log Analytics, not to BatchState
Fixing this properly would require either:
- Jobs writing timestamps on every cron run (adds latency + DB writes)
- KQL queries for last job timestamps (complex, different data source)
The 7-day success rates from KQL already tell you if jobs are healthy

Azure Workbooks vs Dashboards:

Workbooks (Microsoft.Insights/workbooks) are Log Analytics-native, support KQL queries, and are linked to a workspace
Dashboards (Microsoft.Portal/dashboards) are Azure Portal dashboards with pinned tiles from various sources
Workbooks are better for KQL-heavy observability since they can reference the workspace directly and support interactive parameters

The log format issue:

Expected: "[METRIC] {"event": ...}"     → substring(Log_s, 9) works
Actual:   "2026-01-24 15:05:27 - [METRIC] {"event": ...}" → need indexof

Using indexof(Log_s, "{") finds the JSON start dynamically regardless of timestamp length. This is the same fix we made to src/api/admin/kql/queries.py earlier.

JSON escape sequences:

\\n in JSON = literal backslash + n (two characters: \ and n)
\n in JSON = actual newline character

In KQL embedded in JSON, you need \n for line breaks within the query string.

KQL bin() column naming:

// Auto-named column (hard to reference)
summarize count() by bin(timestamp, 1d)

// Explicitly named column (easy to reference)
summarize count() by ts = bin(timestamp, 1d)
| project ts, ...  // works!

Always name bin() columns explicitly when you need to reference them later.

claude-transcript-viewer

Set up CI/CD publishing via GitHub Actions with npm trusted publishing, and configured the embedding server architecture for semantic search. Worked through npm 2FA blockers for automated publishing, switching to granular access tokens and then OIDC-based trusted publishing. Also explored the embedding dimension trade-offs and ONNX model conversion for the search backend.

Version-gated publishing: The workflow checks if the version in package.json differs from npm before publishing. This prevents duplicate publish attempts and lets you control releases by bumping the version.

Registry URL matters: The registry-url option in setup-node configures npm to authenticate with that registry, enabling the NODE_AUTH_TOKEN env var to work.

npm 2FA and CI/CD: Your npm account has 2FA enabled, which blocks automated publishes with regular tokens. You need an automation token (also called “granular access token” or “publish token”) that bypasses OTP for CI workflows.

Granular npm tokens: Unlike classic tokens, granular access tokens can bypass 2FA/passkey requirements for automation. They’re scoped to specific packages and permissions, making them ideal for CI/CD.

npm pkg fix: Automatically corrects common package.json issues like path normalization (./dist → dist) and URL formats.

Trusted Publishing uses OpenID Connect (OIDC) to let GitHub Actions authenticate directly with npm - no stored tokens needed. GitHub proves the workflow’s identity to npm, and npm grants temporary publish permissions. This is more secure because:

No long-lived secrets to leak
Permissions are scoped to specific repos/workflows
Automatic token rotation on every run

Two-path usage: The viewer takes two optional paths:

Archive path (required): Where the HTML transcripts live (generated by claude-code-transcripts)
Source path (optional): Where the raw JSONL files are (~/.claude/projects) - enables search indexing

The viewer uses a client-server architecture for embeddings:

Viewer (what you installed) - handles indexing, search, and the web UI
Embedding server (separate) - converts text to vectors for semantic search

This separation allows using different embedding backends (local MLX models, remote APIs, etc.) without changing the viewer.

Embedding dimension isn’t a quality measure - it’s a trade-off. Smaller dimensions (384) are faster to compare and store less data. Larger dimensions (768-1024) can capture more nuance but have diminishing returns. For transcript search, 384-768 is plenty.

ONNX is a standardized format for ML models. Converting from MLX/PyTorch to ONNX preserves the model’s learned weights - the math is identical, just executed by a different engine. Minor floating-point differences (1e-6) may occur but don’t affect search quality.

The implementation uses a process manager pattern - Node.js spawns Python as a child process and communicates via HTTP. This is common for polyglot applications where you want to leverage the best tool for each job (MLX for ML, Node.js for web serving).