fix: make exclude patterns recursive to prevent index pollution#76
fix: make exclude patterns recursive to prevent index pollution#76PatrickSys merged 4 commits intomasterfrom
Conversation
The indexer's exclude patterns were non-recursive (e.g. `coverage/**`), only matching at the project root. Nested occurrences in monorepo packages and worktrees passed through, polluting the index with generated artifacts and worktree copies. - Extract EXCLUDED_DIRECTORY_NAMES and EXCLUDED_GLOB_PATTERNS into src/constants/codebase-context.ts as the single source of truth - Indexer, file-watcher, and project-discovery all import from there - Add missing directories: .cache, .claude, .planning, worktrees, target, vendor, .nx, .turbo, .next, build - Add integration test reproducing the consumer audit failure case (nested coverage/, .claude/worktrees/, worktrees/, dist/)
Greptile SummaryThis PR centralises all exclude-directory logic into a single Changes:
Issues found:
Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| src/constants/codebase-context.ts | Adds EXCLUDED_DIRECTORY_NAMES, EXCLUDED_GLOB_PATTERNS, and DISCOVERY_ONLY_IGNORED as a single source of truth; minor issue: EXCLUDED_GLOB_PATTERNS should be typed as readonly string[] |
| src/core/indexer.ts | Replaces hardcoded non-recursive exclude patterns (e.g. node_modules/) with EXCLUDED_GLOB_PATTERNS (/{dir}/**) — core fix for nested artifact indexing; logically correct |
| src/core/file-watcher.ts | Adopts shared EXCLUDED_GLOB_PATTERNS; adds previously-missing directories (build, target, vendor, worktrees, .claude) to the ignored list; unnecessary spread is a minor style issue |
| src/utils/project-discovery.ts | Rebuilds IGNORED_DIRECTORY_NAMES from the two shared constants; change is backward-compatible and only adds new directories (.cache, .claude, .nx, .planning, .codebase-context, worktrees) |
| tests/indexer-exclude-patterns.test.ts | New integration test for the audit failure case; has two issues: analyzerRegistry not cleaned up between tests (potential singleton pollution), and the index-format fallback silently produces an empty chunk list that can mask test failures |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[src/constants/codebase-context.ts] -->|EXCLUDED_DIRECTORY_NAMES| B[EXCLUDED_GLOB_PATTERNS\n**\/dir\/**]
A -->|DISCOVERY_ONLY_IGNORED| C[project-discovery\nIGNORED_DIRECTORY_NAMES]
B --> D[src/core/indexer.ts\nexclude: EXCLUDED_GLOB_PATTERNS]
B --> E[src/core/file-watcher.ts\nignored: EXCLUDED_GLOB_PATTERNS]
A -->|EXCLUDED_DIRECTORY_NAMES| C
D -->|glob exclude| F{File path\nmatches **/dir/**?}
E -->|chokidar ignored| F
C -->|Set.has check| G{Directory name\nin ignored set?}
F -->|Yes| H[Skip — not indexed / not watched]
F -->|No| I[Index / Watch]
G -->|Yes| J[Skip — don't recurse]
G -->|No| K[Recurse into directory]
Last reviewed commit: "style: format with p..."
| let tempDir: string; | ||
|
|
||
| beforeEach(async () => { | ||
| analyzerRegistry.register(new GenericAnalyzer()); |
There was a problem hiding this comment.
analyzerRegistry not cleaned up between tests
analyzerRegistry.register(new GenericAnalyzer()) is called in beforeEach but there is no corresponding cleanup in afterEach. Because analyzerRegistry is a module-level singleton (ES module caching means all tests in the same worker share it), each additional test added to this describe block will register another GenericAnalyzer instance, causing the registry to accumulate duplicates across runs. If analyzerRegistry has a clear() or unregister() method, call it in afterEach; otherwise, use vi.resetModules() or move the setup into the single test body.
| analyzerRegistry.register(new GenericAnalyzer()); | |
| analyzerRegistry.clear(); // reset any previously registered analyzers | |
| analyzerRegistry.register(new GenericAnalyzer()); |
(If clear() does not exist, expose one on the registry or isolate module state with vi.isolateModules.)
| const chunks = ( | ||
| Array.isArray(indexRaw) | ||
| ? indexRaw | ||
| : Array.isArray(indexRaw?.chunks) | ||
| ? indexRaw.chunks | ||
| : [] | ||
| ) as Array<{ filePath: string }>; |
There was a problem hiding this comment.
Fragile index-format detection silently produces empty chunk list
If the on-disk format of KEYWORD_INDEX_FILENAME is neither a bare array nor { chunks: [...] }, the chunks variable silently defaults to []. In that scenario:
indexedPathsis[]- The "polluter" assertions (
expect(leaked).toEqual([])) all pass trivially - Only the "legitimate file must be indexed" assertion catches the problem
A silent pass on the polluter checks is dangerous because it looks like the exclusion works when the test is actually not exercising anything. Consider throwing an explicit error when the format is unrecognised:
| const chunks = ( | |
| Array.isArray(indexRaw) | |
| ? indexRaw | |
| : Array.isArray(indexRaw?.chunks) | |
| ? indexRaw.chunks | |
| : [] | |
| ) as Array<{ filePath: string }>; | |
| const chunks = ( | |
| Array.isArray(indexRaw) | |
| ? indexRaw | |
| : Array.isArray(indexRaw?.chunks) | |
| ? indexRaw.chunks | |
| : (() => { throw new Error(`Unexpected index format: ${JSON.stringify(Object.keys(indexRaw ?? {}))}`); })() | |
| ) as Array<{ filePath: string }>; |
src/constants/codebase-context.ts
Outdated
| export const EXCLUDED_GLOB_PATTERNS: string[] = EXCLUDED_DIRECTORY_NAMES.map( | ||
| (dir) => `**/${dir}/**` | ||
| ); |
There was a problem hiding this comment.
EXCLUDED_GLOB_PATTERNS should be readonly string[]
EXCLUDED_DIRECTORY_NAMES is correctly declared as const (giving it a readonly tuple type), but EXCLUDED_GLOB_PATTERNS is typed as a mutable string[]. Any consumer can accidentally push, pop, or splice into it, breaking all three callers (indexer, file-watcher, and project-discovery).
| export const EXCLUDED_GLOB_PATTERNS: string[] = EXCLUDED_DIRECTORY_NAMES.map( | |
| (dir) => `**/${dir}/**` | |
| ); | |
| export const EXCLUDED_GLOB_PATTERNS: readonly string[] = EXCLUDED_DIRECTORY_NAMES.map( | |
| (dir) => `**/${dir}/**` | |
| ); |
| '**/.next/**', | ||
| '**/.cache/**' | ||
| ], | ||
| ignored: [...EXCLUDED_GLOB_PATTERNS], |
There was a problem hiding this comment.
Unnecessary spread of an already-mutable array
EXCLUDED_GLOB_PATTERNS is already typed as string[], so spreading it into a new array literal is redundant — it just allocates a copy.
| ignored: [...EXCLUDED_GLOB_PATTERNS], | |
| ignored: EXCLUDED_GLOB_PATTERNS, |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 31fdda5df3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| '.nx', | ||
| '.planning', | ||
| '.turbo', | ||
| 'build', |
There was a problem hiding this comment.
Do not drop nested build scripts from indexing
Adding 'build' to the shared recursive exclude list means mergeConfig() now passes **/build/** to scanFiles() (src/core/indexer.ts:278,1033-1039) and startFileWatcher() ignores the same paths (src/core/file-watcher.ts:46-50). In monorepos that keep checked-in tooling under packages/*/build/ or similar, those source files will silently disappear from search and auto-refresh even though the old build/** rule only skipped the repository root’s build output.
Useful? React with 👍 / 👎.
- Make EXCLUDED_GLOB_PATTERNS readonly to prevent accidental mutation by consumers (spreads at call sites are now intentional, creating mutable copies for APIs that require string[]) - Throw on unrecognized index format in test instead of silently defaulting to empty array (prevents polluter assertions from passing vacuously) - Move analyzerRegistry.register into test body — only one test, no need for beforeEach ceremony
Set.has() requires the argument to match the set's type parameter. Spreading as-const tuples into a Set infers a narrow literal union, which rejects entry.name (plain string) at the call site on line 178.
Summary
EXCLUDED_DIRECTORY_NAMESandEXCLUDED_GLOB_PATTERNSintosrc/constants/codebase-context.ts— single source of truth for the indexer, file-watcher, and project-discovery**/coverage/**instead ofcoverage/**) so nested occurrences in monorepo packages and worktrees are caught.cache,.claude,.planning,worktrees,target,vendor,.nx,.turbo,.next,buildFixes the two high-severity findings from the consumer audit (2026-03-17, SSP_Portal):
.claudedirectories leaking into the indexpackages/ui/coverage/prettify.js) indexed as real codeThis is the only unique delta from the closed PR #75 (commit d1197a9).
Test plan
tests/indexer-exclude-patterns.test.ts) reproduces the exact audit failure case — nestedcoverage/,.claude/worktrees/,worktrees/, anddist/directories