Dev Log: February 14, 2026

Courses

Continued analysis of FGD vs. Gavel scheduling experiment results. Dug into the comparison data and discovered that FGD-only’s lower average JCT is misleading because it saturates at every load level, completing far fewer jobs than Gavel. Built a data pipeline and canvas-based comparison chart system to visualize the full results across 117 raw result JSONs, with error bands and per-seed markers for variance visibility.

Key findings:

FGD-only wins on avg JCT at every load level — but this is misleading because it saturated at every load and only completed 15-73K jobs vs Gavel’s 44-358K. Lower JCT + fewer completions means it’s only finishing “easy” jobs.
Gavel stays unsaturated up to ~160 jph (completing 120-180K window jobs), while FGD-only saturates everywhere — confirming heterogeneity-aware allocation is critical.
Gavel+FGD converges toward FGD at high loads (both saturated), but handles medium loads (135-210 jph) much better than FGD-only.
The num_completed_jobs metric is essential for correct interpretation — a lower JCT with dramatically fewer completions is worse, not better.

Architecture of the comparison charts:

Data pipeline separation — The Python script (generate_comparison.py) aggregates the 117 raw result JSONs into a single full_comparison.json with pre-computed means, min/max, and per-seed values. This keeps the JS renderer simple and avoids fetching 117 files at load time.
Canvas rendering for consistency — The comparison charts use the same custom Canvas 2D approach as the existing time-series charts, matching the dark theme and hover tooltip patterns. No external chart library needed.
Error bands + seed markers — Each line shows the mean across 3 seeds with a shaded min/max band (10% opacity fill) and small dots for individual seed values. This gives both the trend and the variance at a glance.

OpenClaw

Investigated the recurring 7-day token expiry issue with Google OAuth, which turned out not to be specific to the project but a platform-level constraint.

The 7-day token expiry isn’t a gog-specific problem — it’s a Google OAuth limitation for any app with a consent screen in “Testing” mode. Tools avoid this in two ways: (1) they ship their own published/verified OAuth client ID, or (2) they use a different auth mechanism entirely (like service accounts or CalDAV).

Short Projects

Worked on the AI productivity gap blog post, building the argument that AI’s impact follows a power law distribution rather than a normal one. Generated comparison charts in parallel, got code review feedback on the draft, and started building image-level commenting support for the blog’s feedback system.

This is a really interesting thesis - essentially that AI’s productivity impact follows a power law distribution, not a normal one. The people who benefit most are already in the top percentile, so the aggregate organizational lift is small. This is a classic “integral under the curve” argument.

The scaling gap is real: 88% of orgs say they “use AI,” but only 7% have fully scaled it. That’s the headline lie — “adoption” is being measured at the org level, not the employee level.
Anthropic’s own data shows power law: Gini coefficients of 0.84-0.86 for usage concentration. Frontier workers send 6x more messages than median users. That’s your integral problem in hard numbers.
74% of companies show no tangible value from AI investments — despite all the adoption headlines.

Running the two chart implementations in parallel is safe here because they write to completely separate files (01_productivity_curve.py vs 02_capability_adoption_gap.py). This is the ideal subagent pattern: independent tasks with no shared state.

The draft is strong. A few things that make it work well as an argument piece:

The Gini coefficient data point (0.84-0.86) is devastatingly effective because it’s a single number that proves the power law thesis
The “tactical tornado” attribution to Ousterhout gives academic weight to the quality argument
The closing line (“The integral only changes when the whole curve moves. Not just the tail.”) callbacks to the mathematical framing in a way that sticks

The existing commenting system is built around text selection (<mark> wrapping). Images need a fundamentally different approach: click-to-comment instead of select-to-comment, CSS outline instead of <mark> tags, and an imgSrc anchor instead of paragraphContext/charOffset for restoring highlights from localStorage.

Text comments anchor to a specific character offset within a paragraph (for precise highlight restoration). Image comments anchor to the src attribute instead, since images are unique elements.
The data-comment-ids attribute on images uses a space-separated ID list to support multiple comments per chart, unlike text <mark> elements which are 1:1 with comments.

Personal Projects

Major planning day for the Nanny Timesheet app. Started with a brainstorming session to explore requirements, then designed the data model around a transaction ledger pattern (the same approach banks use for account balances). Chose SQLite for the database, magic links for auth, and front-loaded all core business logic with test coverage before any UI work. Ran a Codex review on the implementation plan, which caught a timezone date parsing bug and a missing email allowlist, both of which would have been painful to debug in production. Also scoped out E2E testing strategy with a test-only login endpoint to bypass magic link auth in CI.

Before diving into building anything, let’s use the brainstorming skill to properly explore your needs, requirements, and design options. This ensures we build the right thing rather than just building something quickly.

Your spreadsheet is essentially a double-entry ledger — accruals on one side, usage on the other, with a running balance. This is a very common pattern that maps well to a simple database with “transactions” (accruals and usages) against “accounts” (sick, vacation).

SQLite sometimes gets a reputation as “local-only” because it’s often used in mobile apps (e.g., every iPhone app uses SQLite locally). But it works perfectly well as a server-side database too — it’s just a file on the server’s disk. For a 2-user app, it handles concurrent reads/writes fine with WAL mode enabled. Sites like Pieter Levels’ Nomad List famously run SQLite in production serving millions of requests.

Using a transaction ledger pattern (instead of storing a mutable “balance” field) means the balance is always SUM(hours) WHERE user_id = X AND type = 'sick'. This makes the system auditable — you can always see exactly why the balance is what it is, and you never get into a state where the balance disagrees with the history. It’s the same pattern banks use for account balances.

Magic links are ideal here because: (a) Jessa doesn’t need to remember a password, (b) for a 2-user app the security threat model is minimal, and (c) it’s significantly less code to build than a full auth system. NextAuth.js has built-in magic link support that handles token generation, expiry, and session management.

The plan front-loads all the core business logic (Phase 2) with pure functions and full test coverage before touching any UI or API code. This means the tricky parts — balance computation, accrual math, schedule-aware hour counting — are proven correct before they get wired into the app. It also means these functions can be reused across API routes, cron jobs, and email builders without duplication.

Codex found several real issues that would have caused bugs during implementation. The two biggest wins: (1) the timezone date parsing bug would have caused tests to fail or produce wrong day-of-week calculations in US timezones, and (2) the missing email allowlist would have left the app open to anyone who discovered the URL. Both are easy to fix in the plan but painful to debug in production.

What the Codex review caught: The two highest-impact issues were (1) a security gap where any email could authenticate and access nanny data, and (2) a timezone date parsing bug where new Date("2026-02-16") creates UTC midnight, which in US timezones shifts .getDay() to the previous day — meaning Monday becomes Sunday. Both would have been difficult to diagnose in production.

Pattern worth noting: The transaction ledger approach is fundamentally sound, but Codex correctly identified that the implementation missed key guards: idempotency constraints, input validation, and HTML escaping. These are the kinds of “boring but important” details that often get skipped in first implementations and cause issues later.

For a small app like this, E2E tests serve a different purpose than in a large codebase. Rather than catching regressions across a big team, they act as a smoke test suite you can run after any change to confirm the core flows still work: “can Jessa log in, see her balance, and submit a request?” This is especially valuable when you’re the sole developer and may come back to this code months later.

The trickiest part of E2E testing with magic link auth is the login step — you can’t click a magic link in a test email inbox. The solution is a test-only login endpoint (/api/auth/test-login) that directly creates a session in the database and sets the cookie, bypassing email entirely. This endpoint is gated behind NODE_ENV !== "production" so it can never be used in prod. This is a common pattern in apps that use passwordless auth.