Dev Log: January 3, 2026
podcast-summarizer-v2
Shipped a significant reliability overhaul today. The core change was moving audio transcription validation from the API container (where it caused CPU-blocking timeouts) to the GPU-enabled batch processor via an event-driven pattern. Also added a stuck job reaper that runs every 6 hours in the batch controller, failing any jobs stuck for more than an hour. The new stages_json field on Delivery now tracks which stage (transcription, summarization, or email) failed, making debugging much easier.
What this PR accomplished:
- Event-driven validation - Instead of CPU-blocking transcription in the API container (which caused timeouts), validation now triggers the GPU-enabled batch processor and returns immediately
- Stuck job recovery - Added a reaper that runs every 6h in the batch controller to fail jobs stuck >1 hour
- Stage tracking -
stages_jsonon Delivery enables debugging which stage (transcription/summarization/email) failed
The deployment pipeline for this PR took about 5 minutes end to end, including Docker image builds and updating all four Azure Container Apps resources.
Why this deployment took ~5 minutes:
- Build Images (~3 min) - Docker builds the API and processor images with all dependencies (faster-whisper, torch, etc.)
- Deploy to Production (~2 min) - Azure Container Apps updates 4 resources: API service + 3 scheduled jobs (discovery poller, batch controller, batch processor)
- This PR also included 2 new database migrations that run on API startup
Hit a post-deploy issue where the API service couldn’t trigger batch processor jobs because it was missing the Azure SDK environment variables. The batch controller already had them, but the API service was newly calling trigger_batch_processor_immediate() as of this PR.
What went wrong: PR #42 introduced code in the API that calls trigger_batch_processor_immediate(), which uses Azure’s Container Apps SDK. The SDK requires AZURE_SUBSCRIPTION_ID and AZURE_RESOURCE_GROUP to construct the resource path for the job trigger API. These were already configured on the batch controller (which also triggers jobs), but we missed adding them to the API service.
Pattern to remember: When a service needs to interact with Azure resources programmatically (not just receive/return data), it needs the SDK authentication triplet: AZURE_SUBSCRIPTION_ID, AZURE_RESOURCE_GROUP, AZURE_CLIENT_ID.