Dev Log: January 8, 2026

CS244 Project Research

Spent the evening surveying past CS244 student projects and the current research landscape for GPU scheduling and LLM serving systems. The goal was to find a project idea that sits at the intersection of my Azure Storage background and the emerging AI infrastructure space, which is surprisingly underexplored in student work despite being all over OSDI and NSDI in recent years. Dug into the Tiresias simulator ecosystem and public GPU cluster traces from Microsoft (Philly) and Alibaba (PAI) to see what’s feasible to build on top of.

CS244 Project Landscape:

Past student projects lean heavily toward networking primitives (congestion control, data center topologies, consensus protocols)
The AI/ML systems space is exploding in top venues - OSDI/NSDI 2024-2025 have numerous papers on LLM serving, GPU scheduling, and multi-tenant inference
This creates a great opportunity: the intersection of your Azure Storage expertise + emerging AI infrastructure problems is underexplored in student projects

CS244 Project Pattern: Past projects follow a consistent formula: (1) pick a well-regarded paper, (2) reproduce key claims using emulation/simulation (Mininet, traces), (3) sometimes extend with a new scenario. Your project should follow this template - it’s what the course expects and grades well.

GPU Scheduler Research Ecosystem:

Established simulators exist - Tiresias has an open-source simulator on GitHub
Two main public traces: Philly (Microsoft, 2017, training-focused) and Alibaba PAI (2020, training+inference)
Evolution: Tiresias→Themis→Gavel→Shockwave shows progression from basic fairness to heterogeneity-awareness
This means you can build on existing infrastructure rather than starting from scratch