Spent the day profiling and then implementing GPU sharing for the Gavel cluster scheduler. Started by discovering that the FGD placement algorithm dominates runtime (node rebuilding is essentially free), then traced a mismatch between the fragmentation metric and the actual workload. The real work was implementing fractional GPU packing at the worker assignment level, gating everything behind an enable_gpu_sharing flag so A/B experiments stay deterministic. Initial results showed no JCT improvement because the LP still treated fractional jobs as full-GPU, so the second half of the day went into making the LP capacity constraint fractional-aware while keeping the objective coefficients unchanged.
The FGD profiling reveals a stark asymmetry: node rebuilding is essentially free, while the placement algorithm itself dominates entirely. This tells us that caching node state between rounds would save almost nothing -- the bottleneck is the O(jobs * nodes * fragmentation) computation inside `schedule_task()`.
This is a mismatch worth noting: the FGD fragmentation metric evaluates how well the cluster can serve a workload with 26% fractional-GPU jobs, but no such jobs actually exist in our simulation. The fragmentation scores are measuring something slightly different from what matters. This could mean FGD is making suboptimal placement decisions -- optimizing for a workload shape that doesn't match reality. A simpler placement heuristic tuned for whole-GPU jobs might perform just as well.
This means **26% of jobs waste GPU capacity** -- a 0.25 GPU job takes a full GPU slot but only uses 25% of its compute. The only recovery is through job pairing, which requires Gavel to find compatible fractional jobs and schedule them together. With ~1,700 active jobs and 26% fractional, there are ~440 fractional jobs that could potentially be paired. Whether they actually get paired depends on Gavel's pairing logic and the LP allocation.
This also explains why utilization is only ~40% at steady state — a significant fraction of “used” GPUs are running fractional jobs at reduced throughput.
**Why this works without changing the LP:** The MaxMinFairness LP computes time-fraction allocations per worker type, but the `_schedule_jobs_on_workers_helper` is the actual capacity gatekeeper. By changing the helper to deduct `gpu_request` (0.25, 0.5) instead of `scale_factor` (1), we allow more fractional jobs to be scheduled per round without touching the LP. The LP will be slightly conservative (treating fractional jobs as full-GPU), but correct.
Why no dispatch loop changes: The simulation processes rounds synchronously — all jobs complete before the next round starts (assert(len(running_jobs) == 0)). This means worker capacity resets naturally. No cross-round tracking needed.
This plan implements GPU sharing at the **worker assignment level** rather than the LP level. The key insight is that Gavel already tracks `gpu_request` on jobs (0.25, 0.5, or 1.0), and throughput scaling already works (`base *= gpu_request` at line 2757). What's missing is the capacity accounting -- currently a 0.25-GPU job monopolizes an entire GPU worker slot. By changing `assigned_worker_ids` from a binary set to a fractional capacity dict, multiple fractional jobs can pack onto one physical GPU.
The flag is gated at every change point so that `enable_gpu_sharing=False` (default) produces byte-identical behavior. This is critical for A/B experiments -- any behavioral change must be explicitly opt-in, and integration tests must remain deterministic.
The helper decides *which* jobs run this round by checking if they fit within remaining capacity. Currently it deducts integer `scale_factor` (always 1 for fractional jobs). With sharing, we deduct the actual `gpu_request` (0.25 or 0.5), so `num_workers_left` can go from 1.0 -> 0.75 -> 0.5 -> 0.25 -> 0.0, packing four 0.25-GPU jobs into one slot. The epsilon tolerance (`1e-9`) prevents floating-point drift from blocking valid placements.
The `< 1e-9` threshold is safe for both paths. When sharing is disabled, `num_workers_left` reaches exactly 0 (integer subtraction), and `0 < 1e-9` is `True`. When sharing is enabled, floating-point drift (e.g., `1.0 - 0.25 - 0.25 - 0.25 - 0.25 = 2.78e-17`) is caught by the epsilon.
The verification block checks that each worker is assigned to exactly 1 job. With GPU sharing, a worker can host multiple fractional jobs (e.g., two 0.5-GPU jobs). Instead of counting assignments, we track total GPU capacity used per worker and verify it doesn't exceed 1.0 (with epsilon tolerance).
Three critical changes in FGD placement for sharing:
1. **Node building**: GPU capacity becomes `max(0.0, 1.0 - used)` instead of binary 0/1. This tells FGD how much room remains on each GPU slot.
2. **Task creation**: Uses actual `gpu_request` (e.g., 0.25) instead of `scale_factor` (always 1 for fractional jobs). FGD then picks a partially-used slot.
3. **Capacity tracking**: After placement, updates the capacity dict rather than adding to a set. The pre-filter changes from `== 0.0` to `< 1e-9` to handle float comparison.
Worker time accounting measures GPU utilization. A 0.25-GPU job running for 600s on a worker should only count as 150s of "worker time" (since the other 75% of the GPU is available for other work). Without this scaling, utilization metrics would be inflated -- the scheduler would think the GPU is 100% utilized when it's only 25% used by this job. Note that `_job_time_so_far` is intentionally NOT scaled -- it tracks wall-clock time the job ran, and the job's throughput already reflects its GPU fraction.
The design keeps all sharing logic behind the `enable_gpu_sharing` gate, so the non-sharing path is byte-identical to before (proven by the integration tests passing deterministically). This is important for A/B experiments: you can run the same Alibaba workload with and without packing and attribute any JCT/utilization difference purely to the sharing feature. The next step is creating experiment configs with `"enable_gpu_sharing": true` and running on FarmShare to measure the impact.
The A/B design uses the same seed, cluster, and load (60 jph) for both arms. The only variable is `enable_gpu_sharing`. Using 3 seeds (0, 1, 2) gives statistical confidence -- if sharing consistently improves JCT across all seeds, the effect is robust. We use FGD placement mode since it was the best-performing config from prior profiling (0.64s/round).
The fact that experiment 3 (sharing) has no "GPUs left unused" warnings is actually a promising early signal. In the baseline, fractional jobs (0.25/0.5 GPU) each monopolize a full GPU, leaving capacity stranded. With sharing enabled, those jobs pack together, which may eliminate the stranded capacity that triggers unused-GPU warnings.
**The JCT is essentially unchanged** -- sharing made virtually no difference (~0.003% improvement). This is surprising but explainable:
The LP bottleneck: The MaxMinFairness LP still uses scale_factor=1 for fractional jobs, so the allocation decision is unchanged. Sharing only affects worker assignment (placement), not how much compute each job is allocated.
Fragmentation nearly doubled (34 -> 61). This is because the FGD fragmentation metric sees partially-filled GPU slots as “fragmented” — a GPU with 0.5 capacity remaining is harder to pack than a fully free GPU. The fragmentation score went up even though packing is working, because the metric penalizes partial slots.
Same completed job count confirms no bugs — sharing didn’t break anything, it just didn’t change the scheduling decisions enough to matter at this load level.
To see real JCT improvement from sharing, the LP itself needs to be fractional-aware — allocating gpu_request instead of scale_factor in the MaxMinFairness policy. That’s the follow-up work mentioned in the plan (“LP optimization is a follow-up”).
This plan fixes a subtle mismatch between the LP formulation and GPU sharing semantics. The key insight is that `scale_factor` (used for distributed training parallelism) and `gpu_request` (the fractional GPU demand) serve different purposes. The LP's capacity constraint should reflect how much physical GPU capacity a job consumes (gpu_request), while the objective coefficients should reflect how much throughput a job gets per unit of time-share (scale_factor). Conflating the two meant fractional jobs appeared to consume full GPU slots, nullifying the sharing benefit.
The `scale_factors_array` helper in `policy.py:37-42` is a generic dict-to-matrix builder -- it takes any `{job_id: value}` dict, not just scale factors. We can reuse it to build the `capacity_array` from `gpu_demands` without any changes to the base class. This is why the plan says "pass a different array" rather than modifying the constraint builder.
Notice that the three solve methods (`_solve_standard`, `_solve_penalty_fresh`, `_solve_dpp`) all receive `scale_factors_array` as a parameter name, but they only use it for the capacity constraint (via `get_base_constraints` or `cache['sf'].value`). By passing `capacity_array` instead, we change the capacity semantics without touching the solve method internals at all. The `coefficients` array (used in the objective) is computed separately and still uses the original `scale_factors_array`. This clean separation is what makes the change safe -- the objective and constraint use independent arrays.
The fix is surgical: only the **capacity constraint** (`sum(capacity * x) <= num_workers`) uses the new `capacity_array`, while the **objective coefficients** (`throughputs * priority_weights * scale_factors`) remain unchanged. This works because:
Capacity: A 0.25-GPU job should consume 0.25 GPU slots, not 1.0. The LP can now fit 4 fractional jobs per GPU.
Objective: Throughput was already scaled by gpu_request at scheduler.py:2800 (base *= gpu_request), so using scale_factor=1 in coefficients is correct — it means “this job runs on 1 worker and gets gpu_request-fraction of the throughput.”
The gpu_demands=None default means all existing code paths (integration tests, non-sharing experiments) are completely unaffected.
short-projects
Built out a King County real estate heatmap, then extended it to include Snohomish County. The day started with getting the data pipeline working (fixing encoding issues, discovering ArcGIS had renumbered its layers), then moved to the React/Leaflet frontend with canvas-rendered markers for 22K+ sales. The afternoon was spent enriching the data with building characteristics (beds, sqft, year built) from the Assessor’s residential building file, adding interactive filters, and switching the color scale from absolute price to a relative-to-median approach. Finally, expanded the pipeline to pull Snohomish County sales data and unified the two counties on one map.
Let me first check what's changed in the repo and search my memory for any previous work on the King County project before diving in.
The architecture is a static-data approach: Python scripts produce a `sales_data.json` file that gets bundled with the React app. No backend server needed at runtime. This is a common pattern for data journalism projects where the dataset is fixed at build time.
The pipeline has three stages that must run sequentially: (1) download + filter sales CSV, (2) geocode parcels via ArcGIS batched queries, (3) join and output JSON. The bottleneck will be step 2 -- with ~15K unique parcels at 100 per batch and 0.5s delay, that's ~75+ batches taking around 40+ seconds minimum, plus API response time. Also, `process_data.py` imports `numpy` but it's not in `requirements.txt` -- I'll fix that.
Classic government data issue -- the King County CSV contains non-UTF-8 characters (likely Windows-1252/latin-1 encoding, common in older municipal data systems). The `0xC3` byte at position 86747 is a dead giveaway. We need to specify `encoding='latin-1'` when reading the CSV, which is a superset of ASCII that handles the extended characters these systems often produce.
The script currently includes all property types (commercial, land, etc.) -- not just residential. PropertyType `11` is likely "Residential" (17,579 of 29,170 sales). However, for a housing heatmap, including all types gives a more complete picture of the market. The design doc mentions "residential only" but the script doesn't actually filter by PropertyType. I'll keep it as-is for now since we want broad coverage, and we can refine later.
The ArcGIS FeatureServer is returning `Invalid Layer or Table ID: 0` -- this means the service endpoint has been restructured since the script was written. ArcGIS REST services organize data in numbered layers, and layer 0 is no longer valid at this URL. I need to discover the correct service URL by querying the service root to see what layers are available.
ArcGIS REST services frequently reorganize their layer IDs without changing the service URL. This is a common gotcha when working with government GIS APIs -- the layer 0 that existed when the script was written has been renumbered to 439. Always query the service root (`/FeatureServer?f=json`) to discover current layer IDs rather than hardcoding them.
The real data paints an interesting picture of King County's housing market:
- **22,603 sales** mapped over the last 12 months
- **Median price: $950K** -- significantly above the national median, reflecting Seattle metro's high costs
- **The 80th percentile is $1.675M** -- meaning 1 in 5 homes sold for over $1.67M
- **The range from $50K to $10M** shows the full spectrum from affordable condos to luxury waterfront properties
- The 1.7 MB JSON file is manageable for a static bundle but will benefit from Leaflet's canvas rendering with 22K markers
The production build is 351KB JS (107KB gzipped) which includes React, Leaflet, and React-Leaflet. That's reasonable for a map-heavy app. The 1.7MB `sales_data.json` file will be served alongside it. With `preferCanvas: true` on the MapContainer, Leaflet will render all 22K circle markers to a single HTML5 Canvas element instead of creating individual SVG/DOM elements, which is essential for performance at this scale.
The blog post follows the same structure as the India air quality post: lead with a striking finding, present the data, show the visualization, explain the methodology, then close with sources. Key stylistic choices: no em-dashes (per CLAUDE.md rules), narrative voice, concrete numbers rather than vague claims, and a methodology section that documents the exact filtering and processing decisions so the analysis is reproducible.
The King County Assessor provides a separate `Residential Building` file that can be joined to sales on Major+Minor (the parcel ID). This is the same dataset that powers the famous "King County House Prices" Kaggle dataset. The join would let us add filters like bedrooms, square footage, year built, and condition, turning the map from a simple price visualization into an interactive property explorer. The data is 50 columns wide, so there's plenty to work with.
The data pipeline follows a clean ETL pattern: separate scripts for fetching raw data (`fetch_*.py`), a processing script that joins everything together (`process_data.py`), and a sample generator for frontend development. This separation means we can add a new data source (building characteristics) without touching the sales fetch logic. The PIN (parcel identification number) is the join key -- constructed by concatenating Major (6 chars) + Minor (4 chars) codes, both zero-padded.
For the building data join, we need to handle one-to-many relationships. A single parcel (PIN) can have multiple building records (e.g., main house + accessory dwelling unit). The plan says to keep the largest `SqFtTotLiving` per PIN, which is the right heuristic -- in residential sales, the buyer is primarily paying for the main structure. The `baths` computation (`full + 0.75*three_quarter + 0.5*half`) follows real estate convention where a full bath has all four fixtures, three-quarter has three, and half has two.
The FilterPanel uses a pattern where building-related filters (beds, sqft, year built) implicitly exclude sales without building data when activated. This is the right UX choice: if someone filters for "3 bedrooms", they clearly want to see houses with known bedroom counts. Sales with missing building data are only hidden when a building filter is active, not when only the price filter is used. The "All" button for bedrooms serves as the visual indicator that no bedroom filter is active.
The filtering logic in App.jsx uses `useMemo` to avoid recomputing on every render. With 22K records, filtering is essentially free (array iteration is sub-millisecond), but memoization prevents unnecessary re-renders of the Map component. The key design decision is that the color scale stays fixed on the full dataset's percentiles -- this means the legend remains stable as users filter, and filtered dots keep their "true" color relative to the whole market. If we recalculated percentiles on the filtered subset, colors would shift confusingly as filters changed.
The column name mismatch (`NbrBedrooms` vs `Bedrooms`) is a common pitfall with government open data -- column names can vary between documentation and actual files, or change between releases. The `available = [c for c in keep_cols if c in bldg.columns]` pattern in the code was meant to handle missing columns gracefully, but the rename dictionary still referenced the wrong name, causing the resulting DataFrame to lack `beds`. When working with external CSVs, always inspect actual column names before coding against them.
CSS `:hover` can't be done with inline styles in React, so we use `onMouseEnter`/`onMouseLeave` with state. The trick is to put the hover zone on a wrapper that includes both the trigger and the dropdown -- this way the panel stays open while the user moves their mouse down into the sliders. Without this, the dropdown would close the instant you leave the trigger text.
Switching from single-select to multiselect means changing the data model from `beds: null | number` to `beds: number[]` (empty array = "All"). The toggle behavior on click adds/removes from the array. This ripples into the filter function in App.jsx where the bedroom check changes from `===` to `.includes()`.
This is a fundamental shift in what the visualization communicates. The current absolute-price scale mostly just shows geography (waterfront/Seattle = red, rural = green). A relative-to-median scale answers a much more interesting question: "given these characteristics, was this sale a deal or a premium?" Using `log2(price/median)` for the mapping gives a symmetric, perceptually balanced scale -- a home at 2x median looks as "extreme" as one at 0.5x median. The log transform also handles the right-skewed nature of price distributions gracefully.
The existing KC pipeline has a three-stage architecture: (1) fetch sales data, (2) geocode parcels via ArcGIS, (3) join + export JSON. Snohomish follows the same pattern but with a key simplification -- the Snohomish sales Excel already includes building characteristics (beds, baths, sqft, year built), so we skip the separate building data join that KC requires with its `EXTR_ResBldg.csv`.
The `fetch_sales_snohomish.py` uses flexible column detection (searching for keywords like "sale"+"date", "parcel", etc.) rather than hardcoding column names. This is deliberate -- county data formats change between releases, and the Excel file's exact headers are unknown until we download it. The script will print all columns on first run so we can verify the mappings.
Key change in `process_data.py`: the `filterRanges` computation no longer depends on the `bldg` variable being non-None (which was KC-specific). Instead it checks `has_bldg = merged["sqft"].notna()` across the combined dataset. This works because Snohomish building data arrives via its sales CSV rather than a separate building file, so `sqft` is populated for both counties after concatenation.
The map center shifted from `[47.5, -122.2]` (King County centroid) to `[47.65, -122.2]` and zoom dropped from 10 to 9. This accommodates Snohomish County extending to ~48.3 latitude. Zoom level 9 in Leaflet covers roughly a 50-mile span, which nicely fits both counties (lat range 47.0 to 48.35) in a single viewport.
The Snohomish Excel uses `Prop_Class` numeric codes where 1xx = residential (111 = SFR, 122 = duplex, etc.), 2xx = commercial, 3xx = vacant land. This is different from King County which uses text labels in `PropertyType`. The `sheet_name="AllSales"` parameter is critical -- the default first sheet is just a disclaimer page. Also note: Snohomish has no "baths" column, so that field will be absent from Snohomish sales in the final JSON.
The two counties' latitude ranges barely overlap at ~47.78, which visually creates a natural boundary on the map. The `county` field in each sale object lets the tooltip show which county the property is in. Note that Snohomish sales lack a `baths` field (the Excel file doesn't include it), so the tooltip will show beds but not baths for Snohomish properties -- a graceful degradation since the template already had null-checks.
The current `RangeSlider` uses two `` sliders -- one for min, one for max. With a range like $50K-$10M, a single pixel drag can jump by thousands of dollars, making precision impossible. Replacing these with number inputs that commit on blur/Enter gives the user direct control. The key UX pattern: let users type freely, only update the filter state when they finish editing (blur or Enter), so partial typing like "500" doesn't immediately filter before they finish typing "500000".
The `RangeInputs` component uses a "local state + commit on blur/Enter" pattern. Each input maintains its own `localMin`/`localMax` string state so the user can freely type (even invalid intermediate text like "$50" while heading to "$500,000"). The filter only updates when the user finishes editing (blur or Enter key). The `useEffect` sync ensures the local text resets properly when the parent triggers a "Reset All". Clamping logic prevents min from exceeding max and vice versa.
bhavanaai
The subtitle appeared in two places because Astro's frontmatter (lines 1-9 between `---`) handles data/logic, while the template below renders the HTML. The `description` prop serves double duty here: it's both the SEO meta description fallback and the visual subtitle, so both needed updating to stay consistent.