Dev Log: February 2, 2026

openclaw

Deep dive into the OpenClaw gateway’s security enforcement model. Discovered that the exec tool has three host modes (sandbox, gateway, node) and that the allowlist/approval enforcement only runs in the gateway code path, meaning commands were executing with zero security checks under the default sandbox mode. Built out a defense-in-depth architecture with three enforcement layers: a gateway allowlist for safe commands, approval forwarding routed to the admin’s Signal (not the chat sender), and dynamic allowlist manipulation tied to grant activation and revocation. Also investigated the exec-approval forwarder’s delivery pipeline and its RPC failures when sending via Signal.

The gateway has two allowlist mechanisms:

Path-based allowlist (matchAllowlist) — matches patterns with / or ~ against resolved executable paths (e.g., /usr/bin/himalaya)
Safe bins (isSafeBinUsage) — a hardcoded set of command names that are always allowed: jq, grep, cut, sort, uniq, head, tail, tr, wc. These are checked by executable name, not path.

Our patterns like ls * were path-less, so they were ignored by the path-based matcher. And himalaya isn’t in the safe bins list. But the bot still ran it, which means the security: "allowlist" mode with ask: "on-miss" is prompting the user (the WhatsApp/Signal sender) for approval rather than blocking outright.

The OpenClaw gateway’s allowlist mechanism operates at the gateway level and has limitations (path-based pattern matching, sender-approves-own-requests). But Claude Code plugins can register PreToolUse hooks via hooks/hooks.json that run before the tool executes. A hook script receives the command as JSON on stdin and returns exit code 2 to deny. This is a much more reliable enforcement point because it’s deterministic, runs before execution, and we control the logic entirely.

The PreToolUse hook is a fundamentally different enforcement layer than the gateway’s allowlist. The gateway allowlist operates inside the exec system and has quirks (path-based matching, sender-approves-own-request). A PreToolUse hook runs before the tool even executes — Claude Code itself calls the hook script, reads the exit code, and blocks the tool call if exit 2 is returned. The agent never gets a chance to run the command. This is defense-in-depth: even if the gateway’s allowlist fails, the hook catches it.

The three-hook approach provides defense at every tool boundary:

Bash hook: catches himalaya send, npm install, curl, systemctl, etc. by matching command patterns
Write/Edit hook: protects plugin configs, skills, and gateway settings from modification
Read hook: prevents access to credential files (.env, .pem, .key, .ssh/*)

Each hook independently reads exec-approvals.json for grant state, and the full grant type (or security: "full") acts as an override that allows everything. The hooks are stateless shell scripts — no daemon, no state to corrupt. The agent can’t bypass them because Claude Code itself enforces the hook before the tool runs.

Key discovery: OpenClaw’s exec-approval-forwarder has a configurable mode setting. By default it’s "session" which sends approval requests to the chat sender (who trivially approves themselves). Setting mode: "targets" routes approvals to specific recipients — like the admin’s Signal. Combined with path-based allowlist patterns, this gives us the exact enforcement model we need: safe commands pass through, sensitive commands require admin approval via Signal.

The final architecture has three enforcement layers:

Gateway allowlist (path-based patterns) — safe commands like /usr/bin/ls, /usr/bin/git pass through immediately
Approval forwarding (mode: “targets”) — commands not in the allowlist trigger an approval request to the admin via Signal, not the chat sender
Dynamic allowlist manipulation — when grants are activated, setGrant() adds the relevant paths (e.g., ~/.local/bin/himalaya * for email grant) to the allowlist; revokeGrant() removes them. The systemd timer still auto-revokes.

This is defense-in-depth: the gateway does the enforcement, the plugin manages the allowlist lifecycle, and the admin approves via out-of-band Signal.

Root cause found: The gateway’s exec tool has three host modes: sandbox (default), gateway, and node. The allowlist/approval enforcement only runs in the gateway code path. When host === "sandbox" with no Docker container configured, the command executes directly with zero security checks — it falls through all the if (host === "node") and if (host === "gateway") blocks straight to runExecProcess(). Setting tools.exec.host = "gateway" routes commands through the code path at bash-tools.exec.js:915 where resolveExecApprovals(), evaluateShellAllowlist(), and requiresExecApproval() are actually called.

The exec-approval forwarder builds a target key signal:uuid:d4e31a04-... and uses deliverOutboundPayloads — a generic outbound delivery pipeline. But regular Signal replies use the conversation’s own reply path (signal provider -> sendMessageSignal directly). The RPC -1 error means signal-cli is rejecting the send, possibly because the delivery pipeline constructs the request differently than the reply path.

courses

Worked on Gavel scheduler replication, focusing on GPU sharing visualization and comparison with published paper results. Investigated a phase-transition artifact in the GpuClustering policy caused by heterogeneous node sizes, digitized reference curves from the paper for visual comparison, and validated that the replication shows the same qualitative patterns as the published figures.

This is actually correct behavior for GpuClustering — it IS a spreading strategy, but it spreads within the “best” node tier first. The plateau reveals heterogeneity in node sizes.
The paper’s reference curves are smoother because the Go reference implementation may handle node selection differently, or the paper’s cluster may have a more uniform node size distribution.
The key question is whether the paper plots show this same plateau. If their GpuClustering line is smooth and linear, it suggests either a different cluster setup or a different implementation of the policy.

This bump is an artifact of the heterogeneous cluster combined with GpuClustering’s greedy “most free GPUs” heuristic. It creates a phase transition where it exhausts its preferred node tier and suddenly floods into smaller nodes.
The paper’s reference curve for GpuClustering is likely smoother because the Go reference implementation may break ties differently or the cluster has a more gradual node size distribution.
This same phase transition explains the occupied-nodes plateau we saw earlier in Fig 9b — they’re the same phenomenon viewed from two different metrics.

The FGD comparison approach is a strong pattern: digitize paper values into a JSON “ground truth” file, then plot with dashed lines. This lets you visually validate how close your replication is to the published results without needing to re-run the paper authors’ code. The JSON also serves as documentation of what the paper claims.

The key figures to digitize are:

Fig 9a (page 11): Single-GPU LAS, shows 5 curves (LAS, LAS w/ Gandiva SS, AlloX, Gavel, Gavel w/ SS), x: 0-8 jobs/hr, y: 0-100 hours
Fig 10a (page 12): Multi-GPU LAS, shows 4 curves (LAS, LAS w/ Gandiva SS, Gavel, Gavel w/ SS), x: 0-3 jobs/hr, y: 0-100 hours
Fig 11a (page 13): Multi-GPU FTF, shows 2 curves (Minimize FTF, Gavel), x: 0-3.5 jobs/hr, y: 0-100 hours

For the replication, we only compare Gavel (heterogeneity-aware) vs its baseline (heterogeneity-agnostic), so we need the LAS and Gavel curves from Figs 9a/10a, and Minimize FTF and Gavel from Fig 11a.

The FGD pattern uses: solid lines for “ours”, dashed lines with alpha=0.5 for “paper reference”. The key design decisions:

Auto-detect paper_reference_curves.json next to the script (same as FGD)
Plot reference curves first (behind) with dashed lines so our data overlays on top
Use matching colors but distinguish by linestyle (solid=ours, dashed=paper)
The _perf suffix in our policies corresponds to “Gavel” (heterogeneity-aware), and the non-_perf corresponds to “Baseline” (heterogeneity-agnostic, equivalent to LAS/FTF)

Key observations from the comparison:

Our replication shows the same qualitative pattern as the paper — the heterogeneity-aware Gavel policy (green) consistently outperforms the baseline (red), and both exhibit the “hockey stick” saturation curve.
Our absolute JCT values are higher at low load (~15-20h vs paper’s ~4-8h). This is expected because our simulation uses a different cluster config (36:36:36 GPUs replicated locally) while the paper’s figures aggregate over many seeds on long traces. The relative improvement trend matches.
The zorder parameter in matplotlib ensures our solid curves render on top of the dashed reference, making the comparison easy to read without visual clutter.

personal-finance

Built out a YNAB integration for budget tracking, including an API client layer, SQLite-backed transaction syncing, and a budget status command with smart categorization thresholds.

Task 1 established the YNAB API client layer. The millunitsToDollars formula Math.round(milliunits / 10) / 100 is a two-step conversion: first divides by 10 (converting milliunits to centiunits), then rounds to avoid floating point issues, then divides by 100 to get dollars. This avoids the classic 0.1 + 0.2 !== 0.3 JavaScript floating point trap.

The UNIQUE constraint on ynab_id in ynab_transactions enables idempotent syncing. Using INSERT OR REPLACE (SQLite’s upsert), re-syncing the same transactions just overwrites them rather than creating duplicates. This is critical for a sync-from-API pattern where you fetch “all transactions since date X” and may overlap with previously synced data.

The budget status command uses a 5% over / 20% under threshold for categorization. This is a pragmatic approach: categories slightly over budget (within 5%) are “on_track” to avoid false alarms from timing differences. Categories significantly under budget (below 80% of expected YTD) are flagged as “under” which could indicate either savings or under-spending that might need reallocation.

The zero-budget result from the live test is an important architectural observation: YNAB’s /months/current endpoint returns the current month’s budget, which may be empty if the user hasn’t set February 2026 budgets yet. For a more robust allocation, you could query a month that has actual budgets (like January 2026) or query /months to find the latest month with non-zero budgets.