Skip to content

[multi-vector] Detect L1/L2 cache sizes for tile budget#1134

Open
wuw92 wants to merge 2 commits into
mainfrom
users/wuw92/cache-size-detection
Open

[multi-vector] Detect L1/L2 cache sizes for tile budget#1134
wuw92 wants to merge 2 commits into
mainfrom
users/wuw92/cache-size-detection

Conversation

@wuw92
Copy link
Copy Markdown
Contributor

@wuw92 wuw92 commented Jun 5, 2026

Why

TileBudget::default() was hardcoded to (L2 = 1.25 MB, L1d = 48 KB, from PR #863), so the tile planner mis-sized tiles on every other microarch. This detects the host's real L1d/L2 and feeds it to TileBudget.

Detection

Arch-first — CPUID where it exists, an OS API only where it doesn't:

Arch / OS Mechanism
x86_64 (Windows, Linux, macOS) CPUID via raw-cpuid
aarch64 Linux sysfs
aarch64 macOS sysctl hw.perflevel0.*, L2 ÷ cpusperl2
anything else CacheInfo::FALLBACK (32 KB / 256 KB)
  • Windows is covered by the x86_64 CPUID path — no Win32 codepath, no
    windows-sys dependency. Only Windows-on-ARM falls back.
  • macOS divides the cluster L2 by cpusperl2: Apple Silicon shares one L2 per
    P-core cluster, so the raw value over-budgets L2 ~4×.

Benchmark

The multi-vector benchmark on my devbox (Windows x86_64) — detected
L1d = 32 KB, L2 = 512 KB vs the hardcoded 48 KB / 1.25 MB. ns /
inner-product, min of 50 measurements, AVX-512 jobs excluded (host is V3).
(Q, D, Dim) = (queries, docs, dim).

f32 — neutral, all shapes within ±2 %:

Q D Dim Hardcoded Detected Δ
8 32 128 5.281 5.227 −1.0 %
16 64 256 5.137 5.176 +0.8 %
32 128 384 7.654 7.690 +0.5 %
32 16 256 5.244 5.322 +1.5 %
64 32 264 5.361 5.293 −1.3 %
32 1250 128 2.632 2.618 −0.5 %
64 1250 512 10.438 10.600 +1.6 %
64 32 128 2.539 2.537 −0.1 %
32 32 512 10.273 10.293 +0.2 %

f16 — improves on 8 of 9 shapes, the last neutral:

Q D Dim Hardcoded Detected Δ
8 32 128 41.844 8.164 −80.5 %
16 64 256 14.863 6.523 −56.1 %
32 16 256 23.730 8.398 −64.6 %
64 32 128 7.507 3.374 −55.1 %
32 32 512 22.324 13.320 −40.3 %
64 32 264 10.166 6.611 −35.0 %
32 128 384 10.669 9.180 −14.0 %
32 1250 128 3.093 2.735 −11.6 %
64 1250 512 10.800 10.825 +0.2 %

…tile budget

Wires runtime L1d/L2 detection into TileBudget::default() so the
multi-vector tile planner sizes A/B tiles against the host's actual
cache geometry instead of hardcoded Skylake-X estimates (1.25 MB L2,
48 KB L1d from PR #863).

Detection lives in diskann-quantization, alongside the existing
ISA-capability probe in isa.rs. Cache size is fundamentally a CPU/arch
property: the OS-API is a discovery mechanism, not the concept being
captured. Putting the module here mirrors the existing
diskann-wide / diskann-vector / diskann-quantization stack, which
handles all arch dispatch internally without depending on
diskann-platform.

Detection strategy follows what gemm-common / faer / OpenBLAS do:
CPUID where available, OS API where required.

- x86_64 (any OS):     CPUID via the `raw-cpuid` crate (one path)
- aarch64 Linux:       sysfs (/sys/devices/system/cpu/cpu0/cache/...)
- aarch64 macOS:       sysctl (hw.perflevel0.*, P-core L2 / cpusperl2)
- Anything else:       CacheInfo::FALLBACK (32 KB L1d, 256 KB L2)

On Apple Silicon the per-cluster L2 is divided by cpusperl2 to give a
per-core budget. Windows-on-ARM falls back to the conservative
defaults: CI doesn't cover that target and DiskANN production doesn't
deploy there; dropping it removes a Win32 codepath and lets the crate
avoid pulling windows-sys.

Equality between CPUID and Win32 GetLogicalProcessorInformationEx was
verified on Windows x86_64 (32 KB L1d / 512 KB L2 on the test host)
during development. Final commit removes the side-by-side test along
with the temporary dependency on diskann-platform.

Closes #1062.
@wuw92 wuw92 requested review from a team and Copilot June 5, 2026 03:54
@wuw92 wuw92 changed the title [diskann-quantization] Detect L1/L2 cache sizes for tile budget [multi-vector] Detect L1/L2 cache sizes for tile budget Jun 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the multi-vector distance tiling planner in diskann-quantization to derive its tile budgets from runtime-detected L1d/L2 cache sizes (with per-arch fallbacks), replacing the prior hardcoded Skylake-X assumptions.

Changes:

  • Added a memoized cache-size probe (L1d, L2) with x86_64 CPUID detection and aarch64 Linux/macOS OS-specific probes, plus a conservative fallback.
  • Updated TileBudget::default() to use detected cache sizes when computing L1/L2-derived budgets for tile planning.
  • Introduced target-specific dependencies (raw-cpuid on x86_64; libc on aarch64 macOS) to support probing.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
diskann-quantization/src/multi_vector/distance/mod.rs Wires in the new cache submodule.
diskann-quantization/src/multi_vector/distance/kernels/mod.rs Switches TileBudget::default() to runtime-detected cache sizes.
diskann-quantization/src/multi_vector/distance/cache/mod.rs Adds memoized cache probing API and basic plausibility/memoization tests.
diskann-quantization/src/multi_vector/distance/cache/cpuid.rs Implements x86_64 cache detection using raw-cpuid.
diskann-quantization/src/multi_vector/distance/cache/linux.rs Implements aarch64 Linux detection via sysfs cache entries.
diskann-quantization/src/multi_vector/distance/cache/macos.rs Implements aarch64 macOS detection via sysctl hw.perflevel0.*.
diskann-quantization/Cargo.toml Adds target-specific dependencies for the detection paths.
Cargo.lock Records the new dependency resolutions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread diskann-quantization/src/multi_vector/distance/cache/cpuid.rs Outdated
Comment thread diskann-quantization/src/multi_vector/distance/cache/linux.rs
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 95.91837% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.41%. Comparing base (6168ef0) to head (3be0354).

Files with missing lines Patch % Lines
...uantization/src/multi_vector/distance/cache/mod.rs 90.47% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1134      +/-   ##
==========================================
+ Coverage   89.40%   89.41%   +0.01%     
==========================================
  Files         485      487       +2     
  Lines       92079    92126      +47     
==========================================
+ Hits        82324    82376      +52     
+ Misses       9755     9750       -5     
Flag Coverage Δ
miri 89.41% <95.91%> (+0.01%) ⬆️
unittests 89.07% <95.91%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ntization/src/multi_vector/distance/cache/cpuid.rs 100.00% <100.00%> (ø)
...ntization/src/multi_vector/distance/kernels/mod.rs 100.00% <100.00%> (ø)
...uantization/src/multi_vector/distance/cache/mod.rs 90.47% <90.47%> (ø)

... and 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- cpuid.rs: reword the module doc so the CPUID 0x4 / 0x8000001D leaf
  selection is attributed to raw-cpuid's cache-parameter enumeration.
  The vendor dispatch is internal to the crate, not visible at our call
  site, which the original wording obscured.
- linux.rs: parse_size uses checked_mul for the K/M/G suffixes so an
  oversized sysfs value returns None instead of silently wrapping in
  release builds, where overflow checks are off. Add a regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for reading L1/L2 cache sizes for different platforms(Windows, Linux, MacOS) for our efficient cache aware multi-vector distance functions.

3 participants