Bhavana AI

AI/ML insights

Dev Log: January 6, 2026

podcast-summarizer-v2

Built out the distributed job processing infrastructure for the podcast summarizer. Started by discovering that the existing BatchState table and Delivery model already had most of the schema foundations needed, then initialized alembic and created migrations for lease columns, expired status, and batch state seeding. Implemented a LeaseManager with atomic acquire/heartbeat/release operations, a CAS-based orchestrator wakeup mechanism with throttling, a sweeper for stuck deliveries, and the orchestrator router itself. Later adapted the codebase to match the actual user tier values ("premium" vs "paid") and extended the existing multi-stage Dockerfile rather than creating separate files.

Key architectural context discovered:

  1. BatchState table already exists with key, value_int, value_text columns - perfect for lease storage without schema changes
  2. Delivery model has validation_job_id for priority signaling - the foundation for validation priority is already in place
  3. No alembic migrations directory exists - will need to initialize alembic first

Phase 1 Complete! Schema foundation laid:

  • 3 migrations for lease columns, expired status, and batch_state seeding
  • Delivery model now has processing_owner_id, processing_started_at, processing_lease_expires_at, processing_attempts
  • batch_state rows seeded for orchestrator and transcriber leases
  • Total: 3/33 tasks done (9%)

Clock-skew tradeoff noted:

  • Plan specified SYSUTCDATETIME() for production DB time
  • Implementation uses Python time.time() for SQLite compatibility
  • Acceptable because: 30-min TTL >> typical Azure clock drift (~ms)
  • Could be enhanced later with dialect detection if needed

Phase 2 Complete! LeaseManager API:

  • acquire(key, owner_id) - Atomic acquisition with single-holder guard (M1, M4)
  • heartbeat(key, owner_id) - Extend lease during long operations
  • release(key, owner_id) - Explicit lease release
  • All support concurrent access with atomic UPDATE (M27)

Phase 3 Complete! Wakeup CAS API:

  • trigger_orchestrator_if_idle(db, request_id) - Atomic CAS trigger
  • 60-second throttle window prevents rapid re-triggers
  • Rollback on failure enables retry (M5, M6, M7, M28)

Phase 4 Complete! Sweeper API:

  • sweep_stuck_deliveries(db) - Requeues expired or fails max-attempts
  • Hard timeout (60min) catches misconfigured heartbeats
  • Sweeper runs at orchestrator start for self-healing (M19, M20)

Phase 5 Complete! Orchestrator API:

  • run_orchestrator(db) - Router-only: doesn’t claim, just triggers workers
  • count_gpu_work() / count_cpu_work() - Check for work needing GPU/CPU
  • expire_stale_deliveries() - TTL cleanup for 7-day-old deliveries
  • Tests: M8, M9, M10, M23 all passing

User model has tier field - but uses "free" and "premium" values (not "paid" as in plan). The implementer needs to adapt the tests to use "premium" instead of "paid".

Existing multi-stage Dockerfile - The project already has a sophisticated Dockerfile with api, jobs, and gpu-base stages. Plan suggests separate files, but extending the existing pattern is cleaner.