Skip to content

fix: prevent discovery cache corruption under concurrent writers#316

Open
akroshg wants to merge 2 commits into
mainfrom
akroshg/fix-discovery-cache-race
Open

fix: prevent discovery cache corruption under concurrent writers#316
akroshg wants to merge 2 commits into
mainfrom
akroshg/fix-discovery-cache-race

Conversation

@akroshg
Copy link
Copy Markdown
Contributor

@akroshg akroshg commented May 29, 2026

Summary

Two fixes that surfaced while debugging intermittent build failures observed by an Edge consumer running many parallel WebUI builds against a shared environment.

1. fix(discovery): prevent cache corruption under concurrent writers

DiscoveryCache::put staged every write under {key}.tmp in the shared ~/.webui/cache/components/ directory. When multiple processes resolved the same package concurrently, each writer reused the same temp path, so a late writer's rename could clobber the cache entry while another writer was still mid-write — producing zero-length or truncated cache entries that subsequent reads then consumed as cache hits.

Fix: stage writes under {key}.{pid}.{counter}.tmp so concurrent writers no longer collide, and clean up the temp file when the final rename fails.

Test: new test_concurrent_put_does_not_corrupt_cache — 8 threads × 64 KB payload, asserts that the post-storm cache is intact and that no orphan .tmp files are left for the key under test. The straggler check is scoped to our key's prefix so sibling tests writing into the same shared cache cannot make it flake.

2. chore(node): surface native addon load failures in build fallback

When the platform-specific native addon failed to load (corrupted binary, missing symbol, version skew), packages/webui silently fell back to the CLI path with no indication of why, and the warning fired on every build invocation.

Fix: capture the loader error, include it in the WASM fallback warning, the CLI fallback warning, and the "cannot build" error. Gate the CLI fallback warning behind cliFallbackWarned so it fires once per process.

Validation

  • cargo fmt --all -- --check — clean
  • cargo clippy --workspace -- -D warnings — clean
  • cargo test -p microsoft-webui-discovery --lib32 passed, including the new race-regression test
  • Edge-side per-process tmpdir fix (separate change) plus this PR validated over 1540 sequential + parallel Edge builds with 0 failures (vs 26.25% pre-fix)

Test coverage notes

  • cache.rs: covered end-to-end by the new concurrency stress test plus the existing round-trip / invalidation / cache-miss tests.
  • index.ts: not covered by a new automated test. The existing integration.test.ts only exercises the happy path. Fault-injecting an addon load failure would require mocking module-private functions or running with a deliberately broken native addon, which is disproportionate effort for ~50 lines of bookkeeping and string formatting. The change is small, obviously correct on inspection, and ships behind a fallback that is itself already exercised in production.

akroshg and others added 2 commits May 28, 2026 23:25
When several processes built against the same ~/.webui/cache/components/
directory concurrently, every writer staged its bytes under the same
{key}.tmp path. The last finisher's rename clobbered the file in place
while another writer was still mid-write, leaving zero-length or
truncated entries that later reads consumed as cache hits.

Stage each write under a unique {key}.{pid}.{counter}.tmp name so
concurrent writers no longer collide, and clean up the temp file when
the final rename fails. Adds test_concurrent_put_does_not_corrupt_cache
(8 threads / 64 KB payload) to guard against regression.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the platform-specific native addon failed to load (corrupted
binary, missing symbol, version skew), packages/webui silently fell
back to the CLI path with no indication of *why*, making the failure
mode hard to triage and producing a noisy warning on every build call.

Capture the original loader error and include it in the WASM fallback
warning, the CLI fallback warning, and the "cannot build" error.
Gate the CLI fallback warning behind cliFallbackWarned so it fires
once per process instead of per build invocation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines +204 to +209
// many times per second; spamming a multi-line warning on every call
// is noisy and (for some terminals) actually expensive.
cliFallbackWarned = true;
console.warn(
`[webui] Native addon failed to load; falling back to CLI binary at ${binPath}.` +
describeAddonError(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we warning and not an error? If native addon failed to load, then the whole thing is broken, right? So spamming might be better or we loose that error if other logs come. A better solution is have better error handling so that we crash on startup with detailed error message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants