Dev Log: December 27, 2025

podcast-summarizer-v2

Major refactor day for the podcast summarizer’s job orchestration. Replaced the controller’s subprocess-based az CLI calls with the native Azure SDK (azure-mgmt-appcontainers), eliminating a brittle dependency and trimming roughly 500MB from the container image. The work followed a strict TDD approach: wrote provider tests first, implemented the provider, then wrote controller integration tests, then refactored the controller to use the new provider. Finished with Bicep infrastructure changes to pass subscription and resource group config to the controller job and grant its managed identity permission to start Container Apps jobs.

The GPU failover system uses separate Container App jobs because Azure requires different workload profiles for GPU vs CPU. A single job can’t dynamically switch between GPU and CPU resources - so we create variants and the controller picks which one to start.

Controller vs Processor Jobs:

Controller (job-controller-podsum-prod) - Checks batch thresholds, picks GPU/CPU mode, starts the appropriate processor job
Processor variants - Do the actual transcription work on specific hardware

The controller is the “brain” that decides which processor to invoke. Running the processor directly bypasses GPU selection logic.

Why this happened: The controller was designed to shell out to az CLI, but the Docker image was built without it. This is a common pattern issue - containers should be self-contained, but CLI tools add ~500MB+ to image size.

Better approaches:

Azure SDK (azure-mgmt-appcontainers) - Python native, uses managed identity
Azure REST API - Direct HTTP calls with managed identity token
Add az to Dockerfile - Quick fix but bloats image

Why Azure SDK > CLI subprocess:

No CLI dependency - Container image stays slim (~100MB saved)
Managed Identity native - DefaultAzureCredential handles auth automatically
Better error handling - Structured exceptions vs parsing stderr
Type safety - SDK has typed models vs string manipulation

The code reviewer raised a valid point about version constraints. However, >=3.1.0 is intentional here:

Azure SDKs maintain backward compatibility for core operations like jobs.begin_start()
The import and functionality work with v4.0.0
Overly restrictive constraints cause dependency resolution issues

For this project, the permissive constraint is acceptable. Moving forward.

The reviewer flagged weak assertions in test_start_job_with_env_vars. However, looking at the implementation plan (Task 4), env var overrides are intentionally deferred - the code will just log them with a TODO. The test is appropriate for the current scope.

For TDD, tests define the interface. The actual env var pass-through can be enhanced later when the feature is fully implemented.

Progress so far:

Tasks 1-4 complete: SDK dependency, config, provider tests, provider implementation
Provider is ready to be used by the controller
Now we write tests for the controller refactor (TDD), then implement

The reviewer raised concerns, but these are expected for TDD:

Tests patch src.jobs.controller.get_container_apps_provider - this is correct because Python patching targets where the name is looked up (after import)
Tests fail because implementation isn’t done - that’s the TDD point
Task 6 will make these tests pass by adding the import and refactoring

The patch path is correct for the intended implementation.

Excellent progress! Tasks 1-6 complete:

SDK dependency added
Configuration ready
Provider implemented with tests
Controller refactored - no more subprocess calls

Now we need infrastructure changes (Bicep) to:

Pass AZURE_SUBSCRIPTION_ID and AZURE_RESOURCE_GROUP env vars to controller job
Grant managed identity permission to start Container Apps jobs

The reviewer raised a valid security point about Contributor role being overly permissive. This was acknowledged in the plan as a known trade-off:

“Note: Using Contributor is broad. For production, consider a custom role with only Microsoft.App/jobs/start/action”

Pragmatic decision:

Contributor at resource group scope is acceptable for initial deployment
Custom role refinement can be a follow-up task
The managed identity is scoped only to this resource group (not subscription-wide)

Proceeding - this is a documented trade-off, not an oversight.