Emit named IN_CREATE/IN_DELETE for inotify directory watches#58
Emit named IN_CREATE/IN_DELETE for inotify directory watches#58chenhunghan wants to merge 3 commits into
Conversation
4aa4da1 to
7fd5187
Compare
| * On failure keep the previous baseline; the next successful snapshot | ||
| * reconciles whatever changed in between. | ||
| */ | ||
| if (dir_snapshot(w->path, &now, &now_n)) { |
There was a problem hiding this comment.
Stale w->path reads the wrong directory after rename
The snapshot/diff design rests on w->path (captured by strdup(path) at inotify_add_watch time, src/syscall/inotify.c:594) being a valid handle to the directory whose entries the diff should compare. The watch's identity, however, is w->host_fd — an O_EVTONLY fd that follows the inode across renames. Once the two diverge, dir_snapshot(w->path, ...) at src/syscall/inotify.c:409 no longer reads the directory the watch is registered for.
I reproduced both failure modes with two small static aarch64 binaries cross-built and run against build/elfuse at this PR's tip (7fd5187).
Scenario A — old path no longer exists
mkdtemp /tmp/A
inotify_add_watch(fd, /tmp/A, IN_CREATE | IN_DELETE)
rename /tmp/A -> /tmp/A-renamed /* watched inode moves with the fd */
touch /tmp/A-renamed/child /* in the watched dir */
mv raises NOTE_RENAME on the watched fd (delivered as IN_MOVE_SELF, correct). The subsequent touch raises NOTE_WRITE on the same fd; process_vnode_event calls dir_snapshot("/tmp/A", ...), opendir returns ENOENT, the function returns false, the keep-baseline branch fires — and no IN_CREATE for child is ever emitted.
Actual output:
watching /tmp/elfuse-renA-KRUJg3 (wd=1)
renamed /tmp/elfuse-renA-KRUJg3 -> /tmp/elfuse-renA-KRUJg3-renamed; old path is now nonexistent
created /tmp/elfuse-renA-KRUJg3-renamed/child in the renamed (watched) dir
IN_CREATE name="child" count: 0 (expected 1 on Linux)
verdict: BUG REPRODUCED (named event lost after rename)
From the caller's perspective the watch silently stops naming children after a rename — exactly the failure mode #55 was opened to fix.
Scenario B — old path is taken by a different inode
mkdtemp /tmp/A
inotify_add_watch(fd, /tmp/A, IN_CREATE | IN_DELETE)
rename /tmp/A -> /tmp/A-renamed /* watched inode moves with the fd */
mkdir /tmp/A /* fresh dir, different inode, unwatched */
touch /tmp/A/decoy /* in the new unwatched dir */
touch /tmp/A-renamed/child /* in the watched dir */
Now opendir("/tmp/A") succeeds, but it reads the new /tmp/A, not the directory the watch tracks. The diff compares the new directory's entries (["decoy"]) to the baseline ([]) and synthesises an IN_CREATE for the decoy. The real child event in the renamed directory never surfaces because the watched dir is never sampled.
Actual output:
step 1: watching /tmp/elfuse-rename-A-nydrPl (wd=1)
step 2: rename /tmp/elfuse-rename-A-nydrPl -> /tmp/elfuse-rename-A-nydrPl-renamed (watched inode moves with the fd)
step 3: mkdir /tmp/elfuse-rename-A-nydrPl again (NEW inode at old path -- unwatched)
step 4a: created /tmp/elfuse-rename-A-nydrPl/decoy in the NEW (unwatched) dir
step 4b: created /tmp/elfuse-rename-A-nydrPl-renamed/child in the WATCHED (renamed) dir
step 5: draining events for ~1s
event: wd=1 mask=0x00000100 (IN_CREATE) len=8 name=decoy
summary:
total IN_CREATE: 1
IN_CREATE name="decoy" (BUG if >0): 1
IN_CREATE name="child" (expected: 1): 0
verdict: BUG REPRODUCED
The emitted event is objectively wrong on both axes: it names a file in a directory the caller never watched, and it suppresses the named event for the actual child of the watched directory. This isn't theoretical — build systems (yarn/pnpm/Cargo) and test runners routinely mv build → build.old; mkdir build; … for atomic swap; the watch on the old build/ will start fabricating events for the unrelated replacement directory after the swap.
Reproducers
Both binaries are small static aarch64 ELFs cross-built with aarch64-linux-gnu-gcc -static -O2 -D_GNU_SOURCE. Sources attached as Scenario A / Scenario B above; they exit non-zero on the buggy path. A correct fix should make both exit zero (Scenario A: emit IN_CREATE name="child"; Scenario B: emit IN_CREATE name="child" and not name="decoy").
This comment was marked as resolved.
This comment was marked as resolved.
EVFILT_VNODE reports that a watched directory changed but not which child, so directory events were queued with no name. fsnotify-based consumers (notably the k0s manifest applier, which re-applies only when a *.yaml entry appears) filter on the entry name and silently drop nameless events, so manifests written after the watch was established were never picked up. Keep a per-watch snapshot of the directory's entry names; on each NOTE_WRITE re-list the directory and diff against the snapshot to emit a named IN_CREATE per added child and IN_DELETE per removed one, matching real inotify semantics. The blocking-read and non-blocking collect paths share one process_vnode_event() helper. The snapshot is allocated on add_watch and freed on rm_watch and inotify_close. Add a regression test (test-inotify Test 6) that watches a fresh directory, creates a child, and asserts a named IN_CREATE for it is delivered; this fails before the fix (the event arrives without a name). Validated with make check on Apple Silicon. (cherry picked from commit 2a5fa28b5257ceb5ed116c231aecc7c4a2ef54ab)
A directory watch snapshots the child-name set on each change and diffs it against the previous baseline to recover the name kqueue omits. dir_snapshot returned an empty list on any failure (opendir error, a mid-read readdir error, or an allocation failure), which the diff could not tell apart from a genuinely empty directory. On a transient failure the IN_DELETE pass then saw every known child as missing and queued a spurious IN_DELETE for each, and the baseline was overwritten with the empty list, so every later change re-reported the whole directory as IN_CREATE. The corruption was permanent. Make dir_snapshot return bool: on failure it frees any partial result and reports false, and readdir read errors are now detected by clearing errno before each call. process_vnode_event only diffs against and advances to a snapshot that succeeded; on failure it keeps the previous baseline so the next successful snapshot reconciles whatever changed in between. The watch-add path documents that a failed initial snapshot is a best-effort empty baseline. Validated with make check on Apple Silicon; test-inotify still passes 6/6.
7fd5187 to
c0a1182
Compare
process_vnode_event() ran opendir()/readdir() (via dir_snapshot) for a directory watch's NOTE_WRITE while holding the global inotify_lock, so a large or slow directory scan blocked every other inotify operation -- on all instances -- for its duration. This is new I/O-under-lock introduced with the named-event diffing. Take the snapshot with the lock released: copy the watch's path and dev/ino under the lock, drop the lock for the opendir/readdir I/O, then re-acquire and re-validate the instance (by guest_fd) and the watch (by host_fd + dev/ino) before applying the diff. A close or host_fd reuse during the window discards the stale snapshot. Both collect paths (collect_events and the blocking read) re-check the instance after the call and return EBADF if it was torn down, mirroring the existing post-kevent re-validation. No change to the emitted events (test-inotify still 6/6); addresses the lock-contention review finding.
|
Rebased onto the latest While here I also addressed the lock-contention review finding in a separate commit (9c28c56): the directory |
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/syscall/inotify.c">
<violation number="1" location="src/syscall/inotify.c:417">
P1: Directory diffs are path-based, so a rename makes named IN_CREATE/IN_DELETE report against the stale pathname instead of the watched inode.</violation>
<violation number="2" location="src/syscall/inotify.c:433">
P2: Revalidation uses recycled numeric fds as instance identity, so a close+reopen race can let a stale directory snapshot mutate the new watch.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| /* Copy the path + identity, then release the lock for the opendir/ | ||
| * readdir snapshot so filesystem I/O does not block other instances. | ||
| */ | ||
| char *path = strdup(w->path); |
There was a problem hiding this comment.
P1: Directory diffs are path-based, so a rename makes named IN_CREATE/IN_DELETE report against the stale pathname instead of the watched inode.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/inotify.c, line 417:
<comment>Directory diffs are path-based, so a rename makes named IN_CREATE/IN_DELETE report against the stale pathname instead of the watched inode.</comment>
<file context>
@@ -398,41 +406,77 @@ static int process_vnode_event(inotify_instance_t *inst,
+ /* Copy the path + identity, then release the lock for the opendir/
+ * readdir snapshot so filesystem I/O does not block other instances.
+ */
+ char *path = strdup(w->path);
+ if (path) {
+ dev_t dev = w->dev;
</file context>
| * different file. Any of these discards the stale snapshot. | ||
| */ | ||
| widx = watch_find_by_hostfd(inst, host_fd); | ||
| if (inotify_state[slot].guest_fd != guest_fd || widx < 0) { |
There was a problem hiding this comment.
P2: Revalidation uses recycled numeric fds as instance identity, so a close+reopen race can let a stale directory snapshot mutate the new watch.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall/inotify.c, line 433:
<comment>Revalidation uses recycled numeric fds as instance identity, so a close+reopen race can let a stale directory snapshot mutate the new watch.</comment>
<file context>
@@ -398,41 +406,77 @@ static int process_vnode_event(inotify_instance_t *inst,
+ * different file. Any of these discards the stale snapshot.
+ */
+ widx = watch_find_by_hostfd(inst, host_fd);
+ if (inotify_state[slot].guest_fd != guest_fd || widx < 0) {
+ free_dir_snapshot(now, now_n);
+ return 0;
</file context>
Fixes #55.
Synthesizes named
IN_CREATE/IN_DELETEfor directory watches by snapshotting the directory's entry names and diffing on eachNOTE_WRITE(kqueueEVFILT_VNODEreports only that the directory changed, not which child). Full rationale in the commit message.Adds a regression test (
test-inotifyTest 6) that watches a fresh directory, creates a child, and asserts a namedIN_CREATEis delivered.Validation on Apple Silicon:
make elfuseandmake checkpass; the new test fails before this change (event arrives with an empty name) and passes after.Summary by cubic
Emit named IN_CREATE/IN_DELETE for directory watches by diffing per‑watch snapshots, and keep the previous baseline on snapshot failure. Snapshots now run without holding
inotify_lockto avoid blocking other watches. Fixes nameless directory events so fsnotify tools can match child names (fixes #55).Written for commit 9c28c56. Summary will update on new commits.