[multi-vector] Detect L1/L2 cache sizes for tile budget#1134
Open
wuw92 wants to merge 2 commits into
Open
Conversation
…tile budget Wires runtime L1d/L2 detection into TileBudget::default() so the multi-vector tile planner sizes A/B tiles against the host's actual cache geometry instead of hardcoded Skylake-X estimates (1.25 MB L2, 48 KB L1d from PR #863). Detection lives in diskann-quantization, alongside the existing ISA-capability probe in isa.rs. Cache size is fundamentally a CPU/arch property: the OS-API is a discovery mechanism, not the concept being captured. Putting the module here mirrors the existing diskann-wide / diskann-vector / diskann-quantization stack, which handles all arch dispatch internally without depending on diskann-platform. Detection strategy follows what gemm-common / faer / OpenBLAS do: CPUID where available, OS API where required. - x86_64 (any OS): CPUID via the `raw-cpuid` crate (one path) - aarch64 Linux: sysfs (/sys/devices/system/cpu/cpu0/cache/...) - aarch64 macOS: sysctl (hw.perflevel0.*, P-core L2 / cpusperl2) - Anything else: CacheInfo::FALLBACK (32 KB L1d, 256 KB L2) On Apple Silicon the per-cluster L2 is divided by cpusperl2 to give a per-core budget. Windows-on-ARM falls back to the conservative defaults: CI doesn't cover that target and DiskANN production doesn't deploy there; dropping it removes a Win32 codepath and lets the crate avoid pulling windows-sys. Equality between CPUID and Win32 GetLogicalProcessorInformationEx was verified on Windows x86_64 (32 KB L1d / 512 KB L2 on the test host) during development. Final commit removes the side-by-side test along with the temporary dependency on diskann-platform. Closes #1062.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the multi-vector distance tiling planner in diskann-quantization to derive its tile budgets from runtime-detected L1d/L2 cache sizes (with per-arch fallbacks), replacing the prior hardcoded Skylake-X assumptions.
Changes:
- Added a memoized cache-size probe (
L1d,L2) with x86_64 CPUID detection and aarch64 Linux/macOS OS-specific probes, plus a conservative fallback. - Updated
TileBudget::default()to use detected cache sizes when computing L1/L2-derived budgets for tile planning. - Introduced target-specific dependencies (
raw-cpuidon x86_64;libcon aarch64 macOS) to support probing.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-quantization/src/multi_vector/distance/mod.rs | Wires in the new cache submodule. |
| diskann-quantization/src/multi_vector/distance/kernels/mod.rs | Switches TileBudget::default() to runtime-detected cache sizes. |
| diskann-quantization/src/multi_vector/distance/cache/mod.rs | Adds memoized cache probing API and basic plausibility/memoization tests. |
| diskann-quantization/src/multi_vector/distance/cache/cpuid.rs | Implements x86_64 cache detection using raw-cpuid. |
| diskann-quantization/src/multi_vector/distance/cache/linux.rs | Implements aarch64 Linux detection via sysfs cache entries. |
| diskann-quantization/src/multi_vector/distance/cache/macos.rs | Implements aarch64 macOS detection via sysctl hw.perflevel0.*. |
| diskann-quantization/Cargo.toml | Adds target-specific dependencies for the detection paths. |
| Cargo.lock | Records the new dependency resolutions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1134 +/- ##
==========================================
+ Coverage 89.40% 89.41% +0.01%
==========================================
Files 485 487 +2
Lines 92079 92126 +47
==========================================
+ Hits 82324 82376 +52
+ Misses 9755 9750 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
- cpuid.rs: reword the module doc so the CPUID 0x4 / 0x8000001D leaf selection is attributed to raw-cpuid's cache-parameter enumeration. The vendor dispatch is internal to the crate, not visible at our call site, which the original wording obscured. - linux.rs: parse_size uses checked_mul for the K/M/G suffixes so an oversized sysfs value returns None instead of silently wrapping in release builds, where overflow checks are off. Add a regression test.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
TileBudget::default()was hardcoded to (L2 = 1.25 MB, L1d = 48 KB, from PR #863), so the tile planner mis-sized tiles on every other microarch. This detects the host's real L1d/L2 and feeds it toTileBudget.Detection
Arch-first — CPUID where it exists, an OS API only where it doesn't:
raw-cpuidhw.perflevel0.*, L2 ÷cpusperl2CacheInfo::FALLBACK(32 KB / 256 KB)windows-sysdependency. Only Windows-on-ARM falls back.cpusperl2: Apple Silicon shares one L2 perP-core cluster, so the raw value over-budgets L2 ~4×.
Benchmark
The multi-vector benchmark on my devbox (Windows x86_64) — detected
L1d = 32 KB, L2 = 512 KBvs the hardcoded48 KB / 1.25 MB. ns /inner-product, min of 50 measurements, AVX-512 jobs excluded (host is V3).
(Q, D, Dim)= (queries, docs, dim).f32 — neutral, all shapes within ±2 %:
f16 — improves on 8 of 9 shapes, the last neutral: