Dev Log: January 6, 2026

podcast-summarizer-v2

Built out the distributed job processing infrastructure for the podcast summarizer. Started by discovering that the existing BatchState table and Delivery model already had most of the schema foundations needed, then initialized alembic and created migrations for lease columns, expired status, and batch state seeding. Implemented a LeaseManager with atomic acquire/heartbeat/release operations, a CAS-based orchestrator wakeup mechanism with throttling, a sweeper for stuck deliveries, and the orchestrator router itself. Later adapted the codebase to match the actual user tier values ("premium" vs "paid") and extended the existing multi-stage Dockerfile rather than creating separate files.

Key architectural context discovered:

BatchState table already exists with key, value_int, value_text columns - perfect for lease storage without schema changes
Delivery model has validation_job_id for priority signaling - the foundation for validation priority is already in place
No alembic migrations directory exists - will need to initialize alembic first

Phase 1 Complete! Schema foundation laid:

3 migrations for lease columns, expired status, and batch_state seeding
Delivery model now has processing_owner_id, processing_started_at, processing_lease_expires_at, processing_attempts
batch_state rows seeded for orchestrator and transcriber leases
Total: 3/33 tasks done (9%)

Clock-skew tradeoff noted:

Plan specified SYSUTCDATETIME() for production DB time
Implementation uses Python time.time() for SQLite compatibility
Acceptable because: 30-min TTL >> typical Azure clock drift (~ms)
Could be enhanced later with dialect detection if needed

Phase 2 Complete! LeaseManager API:

acquire(key, owner_id) - Atomic acquisition with single-holder guard (M1, M4)
heartbeat(key, owner_id) - Extend lease during long operations
release(key, owner_id) - Explicit lease release
All support concurrent access with atomic UPDATE (M27)

Phase 3 Complete! Wakeup CAS API:

trigger_orchestrator_if_idle(db, request_id) - Atomic CAS trigger
60-second throttle window prevents rapid re-triggers
Rollback on failure enables retry (M5, M6, M7, M28)

Phase 4 Complete! Sweeper API:

sweep_stuck_deliveries(db) - Requeues expired or fails max-attempts
Hard timeout (60min) catches misconfigured heartbeats
Sweeper runs at orchestrator start for self-healing (M19, M20)

Phase 5 Complete! Orchestrator API:

run_orchestrator(db) - Router-only: doesn’t claim, just triggers workers
count_gpu_work() / count_cpu_work() - Check for work needing GPU/CPU
expire_stale_deliveries() - TTL cleanup for 7-day-old deliveries
Tests: M8, M9, M10, M23 all passing

User model has tier field - but uses "free" and "premium" values (not "paid" as in plan). The implementer needs to adapt the tests to use "premium" instead of "paid".

Existing multi-stage Dockerfile - The project already has a sophisticated Dockerfile with api, jobs, and gpu-base stages. Plan suggests separate files, but extending the existing pattern is cleaner.