fix(llmo): include subpath in generateDataFolder to prevent SharePoint folder collisions#2315
fix(llmo): include subpath in generateDataFolder to prevent SharePoint folder collisions#2315alinarublea merged 18 commits intomainfrom
Conversation
…t folder collisions Subpath sites on the same domain (e.g. nba.com/kings, nba.com/lakers) were mapping to the same SharePoint folder because only the hostname was used. Now the URL path is included so each subpath site gets a unique folder name. Backward compatible: root-domain sites produce the same folder as before. Fixes LLMO-4186 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
alinarublea
left a comment
There was a problem hiding this comment.
Code review
Fix is correct and the test coverage is solid. Two observations:
url.hostname vs companion PR's url.host
This PR correctly uses url.hostname (no port). The companion fix in spacecat-shared PR #1577 uses url.host (includes port) for resolveCustomerSecretsName. The inconsistency means a subpath site with an explicit port would get a different secret key but the same folder name. Worth aligning both to url.hostname.
Missing test for nested paths
https://nba.com/us/kings → dev/nba-com-us-kings is the expected behavior but there's no test covering it. One extra assertion in the new test would be enough.
Neither is a blocker.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
This PR will trigger a patch release when merged. |
…generateDataFolder
- Split pathname into segments and join with '--' so 'nba.com/us/kings' and
'nba.com/us-kings' produce distinct folder names
- Use split('/').filter(Boolean) to handle repeated slashes correctly
- Update JSDoc with folder name format and examples
- Add trailing-slash idempotence test for subpaths
- Add separator-collision test
Addresses review feedback on LLMO-4186
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
dzehnder
left a comment
There was a problem hiding this comment.
Hey @alinarublea,
Thanks for the focused fix and the upfront self-review. The core change is correct and the tests cover the headline scenarios, but a few things came up around the new identity convention that warrant attention before this lands - mostly because generateDataFolder is the load-bearing identity for a customer's SharePoint data, and a folder collision or a silent rename has user-visible consequences.
Strengths
- The fix correctly closes the headline collision: distinct subpath sites on the same hostname (
nba.com/kingsvsnba.com/lakers) now resolve to distinct folders (src/controllers/llmo/llmo-onboarding.js:272-273). - Backward compatibility for root-domain new onboardings is preserved by the
pathSuffix ? ... : hostternary - root sites land in the same folder names as before. Importantly,composeBaseURL(the upstream caller) lower-cases the input, sopathSuffix.toLowerCase()is belt-and-suspenders but harmless. - The blast radius is right: existing onboarded sites read the persisted
dataFolderfromSiteconfig, so this PR is largely a no-op for them. The fix only changes behavior on the paths that re-derive the folder frombaseURL. - New tests at
test/controllers/llmo/llmo-onboarding.test.js:385-405cover subpath uniqueness, root trailing-slash equivalence, and nested subpaths in bothprodanddev. Both points raised in the self-review (hostname/host alignment, nested-path test) are resolved before review was requested.
Issues
Critical (Must Fix)
-
Separator ambiguity allows two distinct sites to land in the same SharePoint folder (
src/controllers/llmo/llmo-onboarding.js:273)-is used both as the inner-character replacement AND as the host/path separator. The result is non-injective:baseURL A baseURL B both resolve to https://nba.com/us/kingshttps://nba.com/us-kingsnba-com-us-kingshttps://nba.com/a/bhttps://nba.com/a-bnba-com-a-bhttps://x.com//doublehttps://x.com/-doublex-com--doublevalidateSiteNotOnboardedonly checks whether the computed folder already exists in SharePoint, so the second tenant onboarding a colliding URL either receives an "already onboarded" error attributed to another tenant (existence leak), or - if onboarded via the Slack path - is silently mapped onto the prior tenant's folder. SharePoint inheritance then exposes one customer's documents to another. This is the same bug class this PR set out to fix, just at a different boundary: two paying customers in one folder is a confidentiality breach.Pick a separator that cannot appear inside either segment - the smallest change is to keep
-for inner replacement and switch the host/path join to_(valid in SharePoint folder names):const dataFolderName = pathSuffix ? `${host}_${pathSuffix}` : host;
Root-domain folders stay unchanged (
nba-com). Add an injectivity test:nba.com/us/kingsshould not equalnba.com/us-kings. -
Migration impact for any already-onboarded subpath customer is unaddressed and breaks offboarding (
src/controllers/llmo/llmo-onboarding.js:273, with downstream effect atsrc/controllers/llmo/llmo-onboarding.js:1478-1481)The PR description says "Backward compatible," but that holds only for sites with no path. Any customer currently onboarded with a non-root
baseURLhas their data undernba-com(host only). After this ships, every code path that re-derives the folder maps the same site tonba-com-kings. Specifically:performLlmoOffboardingusesllmoConfig?.dataFolder ?? generateDataFolder(baseURL, env.ENV). If a legacy site's persisteddataFolderis missing, offboarding now computes the new name and callsdeleteSharePointFolderon a folder that does not exist - leaving real customer data undeleted (compliance / GDPR concern) and silently completing offboarding while the data persists.validateSiteNotOnboardedfor any subsequent re-onboard would not find the new folder (still under the old name), pass validation, and create a fresh empty folder. The customer's prior dashboards live under the old folder; the API now writes under the new one. Split / partial data, no error surfaced.appendRowsToQueryIndexandenqueueLlmoOnboardingPublishlikewise resolve to the new name and produce broken cross-references.
Two paths forward, pick one and document in the PR description / LLMO-4186:
- Run a one-time DB query against
Sitefor any record with a non-/path inbaseURL. If zero exist, document that explicitly on the PR and finding (2) becomes "ship with a follow-up to backfillllmoConfig.dataFolderso future re-derivations are stable." - If subpath sites do exist, persist their existing
dataFolderinto their LLMO config before deploy, so thellmoConfig?.dataFolder ?? ...fallback hits the cached value, then enable the new derivation only for new onboardings.
Important (Should Fix)
-
JSDoc is stale and now actively misleading (
src/controllers/llmo/llmo-onboarding.js:262-267)The pre-existing block reads
@param {string} domain - The domain name, but the actual parameter isbaseURL, and after this PR the function meaningfully consumes the URL path, not just the hostname. The PR touched the body but left the contract unchanged. Suggested:/** * Generates the SharePoint data folder name from a site's base URL. * Includes the URL path (if any) so subpath sites on the same hostname * map to distinct folders. Root-domain sites produce the same folder * name as before this change. * @param {string} baseURL - The site's base URL (must be a fully-qualified URL). * @param {string} env - The environment ('prod' or anything else, treated as dev). * @returns {string} The data folder name (prefixed with 'dev/' for non-prod). */
This is the only documentation a future caller sees, and the parameter rename + behavior change should be reflected.
-
Missing test for subpath trailing-slash equivalence (
test/controllers/llmo/llmo-onboarding.test.js:393-397)The new trailing-slash idempotence test only covers root URLs. The same property holds for subpaths and is not asserted - even though the implementation depends on
replace(/^\/|\/$/g, ''). A future refactor (e.g. swapping topathname.split('/').filter(Boolean).join('-')) could silently change behavior for subpath URLs and create two folders for the same site. Trailing slashes on subpath URLs are very common in onboarding inputs (Slack pastes, CMS configs).expect(generateDataFolder('https://nba.com/kings/', 'prod')) .to.equal(generateDataFolder('https://nba.com/kings', 'prod'));
Minor (Nice to Have)
-
Empty path segments, double slashes, and percent-encoded characters produce surprising folder names (
src/controllers/llmo/llmo-onboarding.js:271-272)https://nba.com//foo->nba-com--foo(double-dash from leading empty segment;^\/|\/$only strips one slash)https://nba.com/.well-known->nba-com--well-knownhttps://nba.com/k%C3%B6nig->nba-com-k-c3-b6nig(URL.pathnamedoes not decode; the encoded form, the decodedhttps://nba.com/könig, and a developer-typedhttps://nba.com/koenigall produce different names)
Combined with the separator-ambiguity item above, these expand the collision surface. Two cheap fixes: collapse runs of
-after sanitization (.replace(/-+/g, '-')) and trim leading/trailing-from the path segment; optionallydecodeURIComponentthe path before sanitizing. Add the double-slash and dot-prefix cases as explicit tests. -
Pin edge-case behavior in the test suite (
test/controllers/llmo/llmo-onboarding.test.js)No assertions for query strings, fragments, uppercase paths, or percent-encoded characters. Pinning current behavior (whatever it is) prevents an unrelated URL-handling change from silently shifting folder names later. 2-3 tightly-scoped assertions are enough.
Recommendations
- Persist
dataFolderas the source of truth. The architectural smell here is thatgenerateDataFolderis treated as deterministic frombaseURLforever, even though the persisted folder name inSite.config.llmo.dataFolderis the actual identity. A site changing itsbaseURL(TLD migration, www -> apex, http -> https, path canonicalization) would silently re-bind to a new folder. TreatgenerateDataFolderas a one-shot bootstrap function and have offboarding / republish / appendRows resolve from the persisted name only, failing loudly when it is missing rather than silently regenerating. That is a separate refactor, but worth filing. - Cross-link the two functions. This PR plus
resolveCustomerSecretsName(the companion inspacecat-shared) now both encode "site -> stable identity" with different separators. Each is correct for its sink (SharePoint accepts hyphens; AWS Secrets Manager keys conventionally use_). A one-line comment in each pointing to its counterpart - "SharePoint-folder-safe slug; seeresolveCustomerSecretsNamefor the Secrets-Manager-safe equivalent" - prevents future confusion cheaply.
Out of scope, worth tracking
- The companion fix in
spacecat-shared(resolveCustomerSecretsName) is correctly excluded here. The migration concern in finding 2 applies on the secrets side too, since secret keys move with the same baseURL change. composeBaseURLin@adobe/spacecat-shared-utilsaccepts and silently transforms invalid inputs. After this PR,generateDataFolderconsumes more of the URL than before, which makes that upstream validation gate more material. A separate hardening PR adding strict validation inonboardCustomer(src/controllers/llmo/llmo.js:946-984) would benefit every caller, not just this one - but that file is outside this PR's diff.- A shared
siteIdentityComponents(baseURL) -> { host, pathSuffix }helper inspacecat-shared-utils, consumed by bothgenerateDataFolderandresolveCustomerSecretsName, would let identity policy evolve in one place. Not for this PR.
Assessment
Ready to merge? With fixes.
Reasoning: the fix is correct for the targeted reproduction and the blast radius is right, but the new separator scheme reproduces the same class of cross-tenant collision the PR is trying to fix (different shape, same consequence), and the offboarding fallback at line 1481 silently corrupts data deletion for any legacy subpath site. Both are fixable in this repo with small diffs. The migration question can be answered with a single DB query against the Site table - if the answer is "no subpath sites in production today," critical finding 2 becomes a tracked follow-up rather than a blocker.
Next Steps
- Resolve the two Critical items: tighten the separator (separator-ambiguity item) and confirm/persist the migration story for legacy subpath sites.
- Update the JSDoc and add the subpath trailing-slash test.
- The Minor edge-case items can land here or in a small follow-up.
| * | ||
| * Examples: | ||
| * https://nba.com -> nba-com | ||
| * https://nba.com/kings -> nba-com--kings |
There was a problem hiding this comment.
Critical 1 - Separator ambiguity allows distinct sites to land in the same folder
- is used both as the inner-character replacement AND as the host/path separator. Result is non-injective:
| baseURL A | baseURL B | both resolve to |
|---|---|---|
https://nba.com/us/kings |
https://nba.com/us-kings |
nba-com-us-kings |
https://nba.com/a/b |
https://nba.com/a-b |
nba-com-a-b |
https://x.com//double |
https://x.com/-double |
x-com--double |
validateSiteNotOnboarded only checks whether the computed folder already exists, so two tenants with colliding URLs either get an existence-leak error attributed to another tenant or are silently mapped to the same SharePoint folder (cross-tenant data exposure - same bug class this PR is meant to fix).
Smallest fix: keep - for inner replacement, switch the host/path join to _:
const dataFolderName = pathSuffix ? `${host}_${pathSuffix}` : host;Root-domain folders stay unchanged. Add an injectivity test: nba.com/us/kings should not equal nba.com/us-kings.
Critical 2 - Migration impact for already-onboarded subpath customers
The "backward compatible" claim holds only for sites with no path. Any customer currently onboarded with a non-root baseURL has data under nba-com (host only). After this ships, every code path that re-derives the folder maps the same site to nba-com-kings:
performLlmoOffboarding(line 1478-1481) usesllmoConfig?.dataFolder ?? generateDataFolder(baseURL, env.ENV). If a legacy site's persisteddataFolderis missing, offboarding computes the new name and callsdeleteSharePointFolderon a folder that does not exist - leaving real customer data undeleted (GDPR concern) while silently completing offboarding.validateSiteNotOnboardedfor re-onboard finds no folder under the new name, passes validation, and creates an empty one. Customer's prior dashboards live under the old folder; new writes go to the new one. Split/partial data, no error.appendRowsToQueryIndexandenqueueLlmoOnboardingPublishresolve to the new name and produce broken cross-references.
Pick one path:
- Run a DB query against
Sitefor any record with a non-/path inbaseURL. If zero exist, document that on the PR and file a follow-up to backfillllmoConfig.dataFolder. - If subpath sites do exist, persist their existing
dataFolderinto their LLMO config before deploy so the offboarding fallback hits the cached value.
Minor - Empty path segments, double slashes, percent-encoded chars
nba.com//foo->nba-com--foo(double-dash;^\/|\/$strips only one slash)nba.com/.well-known->nba-com--well-knownnba.com/k%C3%B6nig->nba-com-k-c3-b6nig(URL.pathname does not decode; encoded vs decoded vs typed-equivalent all produce different names)
Cheap fixes: collapse runs of - (.replace(/-+/g, '-')), trim leading/trailing - from the path segment, optionally decodeURIComponent the path before sanitizing.
| * The hostname and (if present) each URL path segment are individually sanitized | ||
| * (non-alphanumeric replaced with `-`, lowercased) and joined with `--` as a | ||
| * path-segment delimiter. The double-hyphen delimiter cannot appear in a sanitized | ||
| * segment, so distinct paths cannot collide. |
There was a problem hiding this comment.
Important - JSDoc (lines 262-267) is stale and now actively misleading
The pre-existing block reads @param {string} domain - The domain name, but the actual parameter is baseURL and after this PR the function meaningfully consumes the URL path, not just the hostname. Suggested:
/**
* Generates the SharePoint data folder name from a site's base URL.
* Includes the URL path (if any) so subpath sites on the same hostname
* map to distinct folders. Root-domain sites produce the same folder
* name as before this change.
* @param {string} baseURL - The site's base URL (must be a fully-qualified URL).
* @param {string} env - The environment ('prod' or anything else, treated as dev).
* @returns {string} The data folder name (prefixed with 'dev/' for non-prod).
*/A @since <version> note recording the format change would also help the next person debugging "why is this customer's folder different."
| it('should produce the same folder name for root domain with or without trailing slash', async () => { | ||
| const { generateDataFolder } = await esmock('../../../src/controllers/llmo/llmo-onboarding.js', {}); | ||
|
|
||
| expect(generateDataFolder('https://nba.com', 'prod')).to.equal(generateDataFolder('https://nba.com/', 'prod')); |
There was a problem hiding this comment.
Important - Missing subpath trailing-slash equivalence test
The trailing-slash idempotence test only covers root URLs (this assertion). The same property holds for subpaths and is not asserted, even though the implementation depends on replace(/^\/|\/$/g, ''). A future refactor (e.g. swapping to pathname.split('/').filter(Boolean).join('-')) could silently change behavior for subpath URLs and create two folders for the same site. Trailing slashes on subpath URLs are very common in onboarding inputs (Slack pastes, CMS configs).
expect(generateDataFolder('https://nba.com/kings/', 'prod'))
.to.equal(generateDataFolder('https://nba.com/kings', 'prod'));…lding intent - decodeURIComponent each path segment before sanitizing so %20 and equivalent decoded forms map to the same folder name - Collapse consecutive - runs after sanitizing - Document deliberate case-folding of path segments and updated param names in JSDoc - Add tests for percent-encoded paths, uppercase paths, and double slashes Addresses remaining review feedback on LLMO-4186 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the thorough review @dzehnder. Addressing each item: Critical #1 (separator ambiguity): Already fixed in a prior commit — Critical #2 (migration): No production Site currently has a non-empty path in Important #1 (JSDoc): Updated with full description, case-folding note, and correct parameter name. Important #2 (subpath trailing-slash test): Added. Minor #1 (percent-encoding + run collapse): Added All 13 |
…aFolder The WHATWG URL parser always normalizes percent-encoding in pathnames, so decodeURIComponent on url.pathname segments can never throw. Removing the dead try/catch restores 100% branch coverage required by CI.
…x/subpath-data-folder
dzehnder
left a comment
There was a problem hiding this comment.
Hey @alinarublea,
Thanks for the iteration - the collapse-runs change closes the path-segment collision class for the documented test cases, the JSDoc rewrite is solid, and the percent-encoding/case-folding/double-slash tests cover most of the previously-flagged edge cases. Two issues surfaced when we re-ran the function against adversarial inputs: a regression introduced by commit f96f1f4 (the decodeURIComponent catch removal), and the same host-vs-path asymmetry we found on the companion PR.
Strengths
Previously flagged issues now addressed:
- Critical 1 (separator ambiguity) is closed for the documented test cases. The
replace(/-+/g, '-')collapse-runs step plus the--join means single-segment paths cannot synthesise a--delimiter.nba.com/us/kings->nba-com--us--kingsis now distinct from any single-segment path. - Critical 2 (offboarding fallback) materially mitigated. Verified at
src/controllers/llmo/llmo-onboarding.js:1304thatsiteConfig.updateLlmoDataFolder(dataFolder.trim())runs inperformLlmoOnboardingbeforesite.save(), so any subpath site onboarded after this PR ships has itsdataFolderpersisted; the fallback at line 1481 is dead code for new sites. Author's "no production Site has a non-empty path in baseURL today" claim is the load-bearing assumption and is reasonable given subpath onboarding is new with LLMO-4186. - Important 1 (JSDoc) is fully rewritten with format description, examples, parameter rename to
baseURL, and the case-folding rationale. Accurate against the implementation for paths. - Important 2 (subpath trailing-slash test) added at
test/controllers/llmo/llmo-onboarding.test.js:400-404. - Minor (edge-case coverage): percent-encoded path test, case-folding test, double-slash test all added.
Issues
Important (Should Fix)
-
URIErrorregression on malformed percent-encoded baseURLs (src/controllers/llmo/llmo-onboarding.js:285)Commit f96f1f4 ("remove unreachable decodeURIComponent catch") removed the
try/catchon the premise that "the WHATWG URL parser always normalizes percent-encoding in pathnames, sodecodeURIComponentonurl.pathnamesegments can never throw." That premise is incorrect. The URL parser preserves percent sequences but does not validate that they are well-formed. Verified by running the function:generateDataFolder('https://a.com/%FF') -> URIError: URI malformed generateDataFolder('https://a.com/%E0%A4%A') -> URIError: URI malformed generateDataFolder('https://a.com/%') -> URIError: URI malformedWhy it matters:
baseURLis user-supplied (Slack onboarding command, HTTP onboarding API). A copy-paste with truncated or non-UTF-8 percent encoding now crashes the onboarding flow at line 1226 (and offboarding at line 1481 if persisteddataFolderis missing) with a 500 instead of producing a sanitized folder name. This is a regression from the prior commit which had a working fallback, and contradicts the author's own commitment in the PR comment ("Added decodeURIComponent per segment with safe fallback").Fix: restore the per-segment
try/catch:.map((seg) => { let decoded; try { decoded = decodeURIComponent(seg); } catch { decoded = seg; } return decoded.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-'); });
Add a regression test for
'https://a.com/%FF'. The "100% branch coverage" justification in the commit message is a CI concern that can be solved with/* c8 ignore next */if the catch resists deterministic testing, but a deterministic test is straightforward. -
Hostname regex was not collapse-runs hardened; host-vs-path collision class remains (
src/controllers/llmo/llmo-onboarding.js:283)The path regex now does collapse-runs via the chained
.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-'). The hostname regex on line 283 was not updated to do the same:const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase(); // no collapse step
So consecutive non-alphanumeric characters in a hostname produce
--runs, the same delimiter used to join host with path segments. Verified:https://nba--com/ -> nba--com https://nba/com -> nba--com COLLIDE https://a..b/ -> a--b https://a/b -> a--b COLLIDEThe JSDoc claim at line 269 ("the double-hyphen delimiter cannot appear in a sanitized segment, so distinct paths cannot collide") therefore overstates what the code delivers. Severity bounded by (a) admin-only onboarding and (b) DNS will not resolve hostnames with consecutive empty labels, so this stays Important rather than Critical.
Same fix as for the companion PR: add a collapse-runs step to the host:
const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');
And a regression test:
generateDataFolder('https://nba--com/', 'prod') !== generateDataFolder('https://nba/com', 'prod').
Minor (Nice to Have)
-
JSDoc claim is narrower than written (
src/controllers/llmo/llmo-onboarding.js:269-270)Once the hostname is also collapse-runs hardened, the claim becomes accurate. If you choose to scope rather than tighten, change to: "The double-hyphen delimiter cannot appear in a sanitized path segment, so distinct paths on the same host cannot collide."
-
Em-dash in JSDoc (
src/controllers/llmo/llmo-onboarding.js:267)Line 267 contains a literal U+2014 em-dash byte sequence. Project house style is no em-dashes. Replace with
-or restructure the parenthetical. -
Adversarial-collision tests are weaker than companion PR 1577 (
test/controllers/llmo/llmo-onboarding.test.js:413-418)PR 1577 explicitly pins
us..kings != us/kingsANDus__kings != us/kings. PR 2315 only pinsus-kings != us/kings. The implementation is correct (the run-collapse handles all three), but a regression that, say, removed the collapse would be caught by the companion PR and missed here for the..and__variants. Addus..kingsandus__kingsto the not-equal block to match the companion's coverage. -
JSDoc does not mention the run-collapse step (
src/controllers/llmo/llmo-onboarding.js:266-268)The doc lists "percent-decoded and individually sanitized (non-alphanumeric replaced with
-, lowercased)" but omits the.replace(/-+/g, '-')collapse, which is the load-bearing piece for separator-collision avoidance. Add "consecutive-collapsed" to the parenthetical.
Recommendations
- The chained
.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-')is functionally equivalent to the single-pass.replace(/[^a-zA-Z0-9]+/g, '-').toLowerCase()used in PR 1577. Optional cleanup, not blocking. If you do consolidate, make sure to apply it to the host as well. - Worth tracking a follow-up: a shared
siteIdentityComponents(baseURL)helper inspacecat-shared-utils(consumed by both this function andresolveCustomerSecretsName) so the algorithm only needs to be hardened in one place. Out of scope here.
Assessment
Ready to merge? With fixes.
Reasoning: The PR is a real security improvement and closes the originally reported separator-ambiguity bug for paths. The blocker is the URIError regression introduced by f96f1f4 - that crashes onboarding on a class of user input that previously worked, and contradicts the author's stated commitment. The host-vs-path asymmetry is a smaller fix but should land in the same commit since it mirrors the issue on the companion PR. After both, the JSDoc claim will be accurate and the implementation will be defensively correct.
Next Steps
- Restore the per-segment
try/catcharounddecodeURIComponent. Add a malformed-percent-encoding test. - Add the collapse-runs step to the hostname regex (and a host-vs-path regression test).
- The Minor items (JSDoc precision, em-dash, adversarial test parity) are cheap to land in the same commit.
| const url = new URL(baseURL); | ||
| const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase(); | ||
| const segments = url.pathname.split('/').filter(Boolean) | ||
| .map((seg) => decodeURIComponent(seg).replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-')); |
There was a problem hiding this comment.
Important - URIError regression on malformed percent-encoded baseURLs
Commit f96f1f4 ("remove unreachable decodeURIComponent catch") removed the try/catch on the premise that the URL parser always normalizes percent-encoding in pathnames. That premise is incorrect: the WHATWG URL parser preserves percent sequences but does not validate that they're well-formed. Verified by running the function:
generateDataFolder('https://a.com/%FF') -> URIError: URI malformed
generateDataFolder('https://a.com/%E0%A4%A') -> URIError: URI malformed
generateDataFolder('https://a.com/%') -> URIError: URI malformed
baseURL is user-supplied (Slack onboarding, HTTP onboarding API). A truncated or non-UTF-8 percent-encoded copy-paste now crashes onboarding with a 500 instead of producing a sanitized folder name. This is a regression vs the prior commit, and contradicts the author's PR comment ("Added decodeURIComponent per segment with safe fallback").
Fix: restore the per-segment try/catch:
.map((seg) => {
let decoded;
try { decoded = decodeURIComponent(seg); } catch { decoded = seg; }
return decoded.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');
});Add a regression test: generateDataFolder('https://a.com/%FF', 'prod') does not throw. The 100% branch coverage justification in the commit message can be addressed with /* c8 ignore next */ if the catch resists deterministic testing, but a deterministic test is straightforward.
| const { hostname } = new URL(baseURL); | ||
| const dataFolderName = hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase(); | ||
| const url = new URL(baseURL); | ||
| const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase(); |
There was a problem hiding this comment.
Important - Hostname regex not collapse-runs hardened; host-vs-path collision class remains
The path regex now collapses runs (chained replace(/-+/g, '-') on line 285), but the hostname regex on this line was not updated. Consecutive non-alphanumeric characters in a hostname produce -- runs, the same delimiter used to join host with path segments. Verified:
https://nba--com/ -> nba--com
https://nba/com -> nba--com COLLIDE
https://a..b/ -> a--b
https://a/b -> a--b COLLIDE
The JSDoc claim at line 269 ("the double-hyphen delimiter cannot appear in a sanitized segment, so distinct paths cannot collide") overstates what the code delivers. Severity bounded by admin-only onboarding and the fact that DNS won't resolve hostnames with empty labels.
Fix: add a collapse-runs step to the host:
const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');Add a regression test: generateDataFolder('https://nba--com/', 'prod') !== generateDataFolder('https://nba/com', 'prod').
Same issue exists on the companion PR adobe/spacecat-shared#1577 - flagged there too so both functions stay aligned.
Restore the per-segment try/catch around decodeURIComponent: the WHATWG URL parser preserves syntactically valid percent sequences like %FF without validating UTF-8 correctness, so decodeURIComponent can still throw on non-UTF-8 encoded paths supplied by users. Removing it caused a regression. Also add run-collapse to the hostname regex so consecutive non-alphanumeric characters (e.g. nba--com) don't produce a -- delimiter that collides with the host-to-path separator used to join path segments. Add regression tests for both: malformed percent-encoding does not throw, and hostnames with consecutive non-alnum chars don't collide with subpaths. Add adversarial separator-parity tests to match companion PR 1577. Fix em-dash (U+2014) in JSDoc and note run-collapse in the description.
…x/subpath-data-folder
- Align regex to + quantifier (consistent with companion resolveCustomerSecretsName) - Extract inline sanitize() with leading/trailing - trim and filter(Boolean) on segments, preventing degenerate all-hyphen segments from causing delimiter inflation or ambiguous folder names - Update JSDoc: scope the injectivity claim honestly, document env semantics - Fix percent-encoded segment test to use property assertion - Hoist esmock import to a single before() for the entire describe block - Add tests: port equality, query string/fragment ignored, degenerate segments dropped, throws on invalid URL
…x/subpath-data-folder
…erSecretsName Guard against empty-hostname URLs (e.g. file:///) which would produce a folder name starting with '--'. Adds a corresponding test. Consistent with the http(s)/hostname validation added to resolveCustomerSecretsName.
…x/subpath-data-folder
dzehnder
left a comment
There was a problem hiding this comment.
Hey @alinarublea,
Thanks for the iteration. The URIError regression is fixed, host/path sanitize is now symmetric, and the JSDoc rewrite reflects the actual algorithm. Two Important items came up: the new .filter(Boolean) step introduces an empty-segment-drop collision class that the JSDoc claims doesn't exist, and the same NFD/NFC unicode-normalization gap that the companion PR has. Both are localized one-line fixes plus a regression test each. Two Minor JSDoc framing items round it out.
Strengths
Previously flagged issues now addressed:
- URIError regression on malformed percent-encoded baseURLs (prior Important 1) is fixed. The restored
try { decoded = decodeURIComponent(seg); } catch { /* keep raw */ }block atsrc/controllers/llmo/llmo-onboarding.js:294-300mutates alet decodedand falls through tosanitize, so%FFno longer crashes. Theshould handle malformed percent-encoded path segments without throwingtest (line 380) pins the contract. - Host-vs-path collision (prior Important 2) closed. The
sanitizehelper at line 291 (replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase()) is applied to BOTH the host and each path segment. Theshould not collide hostname with consecutive non-alnum chars and a subpathtest (line 375) pinsnba--com/!==nba/com. - JSDoc rewrite (prior Minor 1, 4) documents the run-collapse step, the
--delimiter invariant, the lossy-punctuation limitation, and the dev/dev/prefix mapping. Em-dashes (prior Minor 2) are absent from the new generateDataFolder JSDoc - verified by byte scan. - Adversarial collision tests (prior Minor 3) are now stronger than companion PR 1577:
nba.com/us/kingsis asserted distinct fromus-kings,us..kings, ANDus--kings. - Path-traversal via dataFolder is not feasible. Sanitize output alphabet is
[a-z0-9-]plus literaldev/prefix and--delimiter. No/,., null, or control chars survive sanitize. URL.pathname resolves literal..segments at parse time. Verified dataFolder lands in/sites/elmo-ui-data/${dataFolder}/(validateSiteNotOnboarded:404, copyFilesToSharepoint:731, deleteSharePointFolder:1029) - all safe given the sanitized alphabet. - Cross-tenant data-read class is bounded. validateSiteNotOnboarded (line 396, called from llmo.js:1024) checks the folder before SharePoint folder creation using the same generateDataFolder, so two tenants cannot simultaneously hold the same folder via this onboarding path.
- Test refactor is in-scope and clean. The
before(async () => { ({ generateDataFolder } = await esmock(...)) })consolidation at test/controllers/llmo/llmo-onboarding.test.js:314 removes ~17 redundant per-test esmock calls. Tests assert public-contract properties (equal/not.equal/match), not internal regex shape - robust to future sanitize refactors. All 13 prior tests are carried over with identical inputs/expectations; nothing was inadvertently dropped. - Backward-compat verified for known callers (llmo.js:1019 onboarding, onboard-llmo-modal.js Slack flow, offboarding fallback at llmo-onboarding.js:1497). All compose dataFolder into URL paths where the sanitized output is URL-safe.
Issues
Important (Should Fix)
-
.filter(Boolean)introduces a new collision class that the JSDoc claims does not exist (src/controllers/llmo/llmo-onboarding.js:301).The new commit added
.filter(Boolean)after the segment sanitize, which silently drops any segment that reduces to empty. As a result, URLs that differ in path structure produce the same folder:https://nba.com/-/kings->nba-com--kings==https://nba.com/kingshttps://nba.com/us/-/kings->nba-com--us--kings==https://nba.com/us/kingshttps://nba.com/.well-known/foo->nba-com--well-known--foo(the.well-knowndecodes; degenerate cases below)https://nba.com/---->nba-com==https://nba.com(host-only)
This is a NEW collision class introduced in this round. The prior commit (
c73276e) preserved bare-hyphen segments (/-/foo->host--foo---foo), which did NOT collide. The JSDoc at lines 272-274 still claims "URLs that differ in path structure produce distinct folder names" - this is now provably wrong.Concretely:
- The collision is bounded by
validateSiteNotOnboarded(existing-folder gate fires before SharePoint folder creation), so it is NOT a cross-tenant data-read class. - It IS a squatting / onboarding-DoS vector: an attacker (or an admin onboarding via the Slack flow with admin gating) who claims
https://victim.com/foo/%2e%2e/barfirst registers the foldervictim-com--foo--bar. The legitimatehttps://victim.com/foo/baronboarding is then rejected at line 419 and pages #llmo-alerts. Recovery requires manual intervention.
Pick one:
- Reject rather than drop empty-after-sanitize segments: replace
.filter(Boolean)withif (segments.some((s) => !s)) throw new Error('baseURL contains a path segment that sanitizes to empty'). Eliminates the class entirely; rejects a tiny number of degenerate baseURLs that probably shouldn't onboard anyway. - Or accept the collision and weaken the JSDoc to symmetric treatment alongside the punctuation-only class ("path segments differing only in punctuation, OR a path segment that sanitizes to empty, are treated as identical").
Add an adversarial test pinning whichever behavior you choose. The current
should drop degenerate path segments that sanitize to emptytest only pins one input shape; broaden it (/-/kings == /---/kings,/us/-/kings == /us/kings) so a future filter-rule change is caught. -
NFD vs NFC unicode forms produce different folders (
src/controllers/llmo/llmo-onboarding.js:291, thesanitizehelper).The new percent-encoded test uses NFC for both literal and encoded forms, so it passes. NFD is not exercised.
decodeURIComponentdoes no unicode normalization, so visually-identical inputs in different forms produce different folders:https://nba.com/könig(NFCö= U+00F6) ->nba-com--k-nighttps://nba.com/könig(NFDo+ U+0308 combining diaeresis) ->nba-com--ko-nig
This is the same class of silent silent-divergence the PR is closing in the opposite direction (different folders for the same logical tenant). Realistic when authors paste tenant subpaths from different sources (CMS, Word, macOS Finder which uses NFD on HFS+, iOS clipboard). Same finding flagged on companion PR 1577 - keeping the two functions consistent matters because tenants exist in both code paths.
One-line fix:
decoded = decoded.normalize('NFC')(or fold intosanitize). Plus one regression test pinningNFC === NFDequivalence.
Minor (Nice to Have)
-
decodeURIComponentcatch-path adds a narrower collision class (src/controllers/llmo/llmo-onboarding.js:297). WhendecodeURIComponent(seg)throws anddecodedfalls through to the raw segment (e.g.%FF),sanitize('%FF')strips the%as non-alphanumeric and returnsff. Sohttps://a.com/%FFcollides withhttps://a.com/ff, andhttps://a.com/foo%80barcollides withhttps://a.com/foo-bar. This is the deliberate trade-off for the prior URIError-DoS fix (collision over throw). Same squatting/DoS vector as the empty-segment class above, narrower input shape. Worth a one-line JSDoc addition acknowledging this, parallel to the existing punctuation note. -
JSDoc framing: "inherent limitation of lossy sanitization" understates the design choice (
src/controllers/llmo/llmo-onboarding.js:274-277). The collision class isn't inherent - it's a deliberate readable-slug-over-injectivity choice. Aslug + short content-hashscheme (e.g.nba-com--us-kings--ab12cd34where the hash is over the canonical URL) would preserve injectivity while keeping the readable prefix. Same observation flagged on companion PR 1577. Worth one-line wording: "limitation of the chosen readable-slug design (a hash-suffixed identity would avoid this at the cost of readability)." No behavior change requested for this PR; just keeping the architectural alternative visible to future readers. -
Add
@throws {TypeError}to JSDoc (src/controllers/llmo/llmo-onboarding.js:280-285). The new testshould throw on a malformed base URLpins thatgenerateDataFolder('not a url', 'prod')throws TypeError, but the JSDoc only says "must be a fully-qualified URL" without naming the exception. Minor JSDoc-completeness gap.
Recommendations
- For the empty-segment-drop class: option (a) "reject" is structurally cleaner and removes the class entirely. Option (b) "accept and document" is fine if multi-tenant subpath onboarding is gated behind admin Slack flow with low attacker reachability. The LLMO-4186 threat model can guide that choice.
- For NFD/NFC: prefer normalize over document. One line, one test, one less silent-failure class.
- IDN/punycode hostnames (informational, NOT a blocker): WHATWG
URL.hostnamereturns ACE form (xn--…), and run-collapse on host turnsxn--wgv71a.example.comintoxn-wgv71a-example-com, which collides with a hypotheticalxn-wgv71a.example.com. Same squatting class as above, very narrow attacker reachability. Acceptable as-is given the round-1 host-vs-path collision fix required this collapse.
Out of scope, worth tracking
- Companion PR
adobe/spacecat-shared#1577(the SECRETS-side equivalent) carries an identical sanitize shape with_/__separators. The two PRs intentionally use different separators so tenants who exist in both code paths produce orthogonal folder names - by design. A sharedsiteIdentityComponents(baseURL)helper would let the algorithm evolve in one place; out of scope here. - TOCTOU race between
validateSiteNotOnboarded'sfolder.exists()andcopyFilesToSharepoint'sfolder.createFolder()exists at HEADf8b06692, exists atc73276e, predates this PR. Not introduced by this change; flag for a separate hardening PR if it ever bites. - The
llmo-referral-traffic.jsandllmo-referral-traffic.test.jsdiffs in this PR's incremental came in via merge from main (PR 2319, LLMO-4261). NOT introduced by this author, NOT in scope.
Assessment
Ready to merge? With fixes.
Reasoning: the cross-tenant data-read class - the original Critical that this PR was opened to fix - is correctly closed by the host+segment sanitize, the -- delimiter, and the validateSiteNotOnboarded gate. The two Important findings are localized: a 1-3 line code change for empty-segment behavior plus broader test coverage, and a one-line .normalize('NFC') plus one regression test. The Minor items are JSDoc tweaks. After they land, this is approve as-is; the architectural and security posture for the LLMO-4186 onboarding population is sound.
Next Steps
- Pick the empty-segment policy (reject or accept-and-document) and broaden the test.
- Add
.normalize('NFC')tosanitizeand a regression test. - The Minor items (catch-path JSDoc note, "inherent" framing tweak,
@throws {TypeError}) can land in the same commit.
| let decoded = seg; | ||
| try { | ||
| decoded = decodeURIComponent(seg); | ||
| } catch { /* keep raw on percent-encoded sequences that are not valid UTF-8 */ } |
There was a problem hiding this comment.
Important: this .filter(Boolean) introduces a new collision class that the JSDoc (lines 272-274 "URLs that differ in path structure produce distinct folder names") claims does not exist:
https://nba.com/-/kings->nba-com--kings==https://nba.com/kingshttps://nba.com/us/-/kings==https://nba.com/us/kingshttps://nba.com/---->nba-com==https://nba.com
New this round (the prior commit c73276e preserved bare-hyphen segments). Bounded by validateSiteNotOnboarded so it's not cross-tenant data-read, but it IS a squatting / onboarding-DoS vector: an attacker who claims https://victim.com/foo/%2e%2e/bar first locks the legitimate https://victim.com/foo/bar out at line 419.
Pick: replace .filter(Boolean) with if (segments.some((s) => !s)) throw new Error('baseURL contains a path segment that sanitizes to empty') (eliminates the class), OR keep the drop and weaken the JSDoc to symmetric treatment alongside the punctuation-only class. Either way, broaden the existing drop degenerate path segments test to pin multi-position cases (/us/-/kings == /us/kings, /-/kings == /---/kings).
| const { hostname } = new URL(baseURL); | ||
| const dataFolderName = hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase(); | ||
| const url = new URL(baseURL); | ||
| if (!url.hostname) { |
There was a problem hiding this comment.
Important: NFD vs NFC unicode forms produce different folders for visually identical URLs. decodeURIComponent doesn't normalize:
https://nba.com/könig(NFCö= U+00F6) ->nba-com--k-nighttps://nba.com/könig(NFDo+ U+0308) ->nba-com--ko-nig
Same class of silent silent-divergence the PR is closing in the opposite direction (different folders for the same logical tenant). Realistic when authors paste subpaths from different sources (CMS, Word, macOS Finder, iOS clipboard).
One-line fix: decoded = decoded.normalize('NFC') before sanitize, plus one regression test pinning NFC === NFD. Same finding raised on companion PR 1577.
| const sanitize = (s) => s.replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase(); | ||
| const host = sanitize(url.hostname); | ||
| const segments = url.pathname.split('/').filter(Boolean) | ||
| .map((seg) => { |
There was a problem hiding this comment.
Minor: when decodeURIComponent(seg) throws and decoded falls through to the raw segment, sanitize('%FF') strips the % as non-alphanumeric and returns ff. So https://a.com/%FF collides with https://a.com/ff, and https://a.com/foo%80bar collides with https://a.com/foo-bar. Deliberate trade-off for the prior URIError-DoS fix, but worth a one-line JSDoc addition parallel to the existing punctuation-collision note: "Path segments containing invalid percent-encoding sequences are sanitized as their raw bytes and may collide with the stripped form."
danieljchuser
left a comment
There was a problem hiding this comment.
Hey @alinarublea,
Thanks for the continued iteration. The core fix is solid - the -- delimiter, run-collapse sanitizer, and validateSiteNotOnboarded gate close the original cross-tenant collision class. The test suite is thorough and the JSDoc rewrite documents the sanitization algorithm honestly. Three Important items came up (one new, two carried forward) and two Minor items - none are blockers but all are worth addressing.
Strengths
- Path-traversal is eliminated by construction: the sanitize output alphabet is
[a-z0-9-]plus the staticdev/prefix and--delimiter. No.,..,/, NUL, or SharePoint path separators can survive. Verified across callers (validateSiteNotOnboarded,copyFilesToSharepoint,deleteSharePointFolder). - The
try/catcharounddecodeURIComponentatsrc/controllers/llmo/llmo-onboarding.js:295-299correctly defends againstURIErrorfrom invalid UTF-8 sequences without crashing onboarding. - The hostname guard at
src/controllers/llmo/llmo-onboarding.js:283-285rejectsdata:,javascript:,file:and other opaque/no-host schemes early. Fail-closed is the right posture for a tenant-identity primitive. - Test suite covers 20+ scenarios including trailing slashes, double slashes, port stripping, case-folding, percent normalization, query/fragment stripping, and degenerate segments. The
before()consolidation removes redundant per-testesmockcalls cleanly. - Previously flagged issues now addressed: separator ambiguity (Critical, round 1), URIError regression (Important, round 3), hostname regex hardening (Important, round 3), stale JSDoc (round 1).
Issues
Important (Should Fix)
-
Backward-compat regression for IDN/punycode hostnames -
src/controllers/llmo/llmo-onboarding.js:294The
+quantifier insanitizecollapses consecutive non-alphanumeric characters in the hostname into a single-. Punycode ACE labels start withxn--, so any IDN domain gets a different folder name than the old code produced:- Old:
xn--fiq228c.com->xn--fiq228c-com - New:
xn--fiq228c.com->xn-fiq228c-com
Prior rounds confirmed "no production subpath sites" but did not confirm "no IDN hostnames." If any onboarded site has a punycode hostname, its historical SharePoint data is orphaned under the old folder name.
Fix: either confirm no IDN hostnames exist in production (document on the PR), or apply the run-collapse only to path segments while keeping per-character replacement for the hostname. Add a regression test asserting
xn--fiq228c.compreserves the--in the folder name. - Old:
-
JSDoc injectivity claim contradicts tested behavior -
src/controllers/llmo/llmo-onboarding.js:272-274The doc says "URLs that differ in path structure produce distinct folder names." The test "should drop degenerate path segments that sanitize to empty" asserts that
/-/kingsand/kingsproduce the same folder. The.filter(Boolean)at line 304 drops segments that sanitize to empty, so two distinct path structures yield identical output.Fix: weaken the JSDoc to acknowledge the empty-segment drop: "Path segments that sanitize to empty (e.g.
/-/) are dropped, so/-/kingsand/kingsproduce the same folder." -
NFC vs NFD unicode normalization gap -
src/controllers/llmo/llmo-onboarding.js:300decodeURIComponentreturns whatever bytes the URL carried. NFC precomposed characters sanitize differently than their NFD decomposed equivalents. Two visually identical URLs can produce different folders. Realistic when authors paste subpaths from different sources (CMS, macOS Finder which uses NFD, iOS clipboard).Fix: add
.normalize('NFC')before sanitize:decoded = decodeURIComponent(seg).normalize('NFC');
Minor (Nice to Have)
-
Catch-path fallback undocumented -
src/controllers/llmo/llmo-onboarding.js:297-301. WhendecodeURIComponentthrows, the raw%XXform is kept and sanitized, so%FF->ffcollides with literal/ff. Deliberate trade-off for the URIError-DoS fix. Worth a one-line JSDoc note. -
Missing
@throws {TypeError}in JSDoc -src/controllers/llmo/llmo-onboarding.js:284-287. Two throw paths exist and are tested. Add@throws {TypeError} When baseURL is not a valid URL or has no hostname.
Recommendations
- Consider extracting the
sanitizehelper to a named function so it can be unit-tested directly. With the collision classes now documented, tighter testing ofsanitizeitself would catch regressions earlier. - Companion PR
adobe/spacecat-shared#1577carries near-identical sanitize logic with_/__separators. A sharedsanitizeUrlToSlug(baseURL, { separator })helper would prevent drift and give one place to add NFC normalization.
Assessment
Approved - the cross-tenant collision class is correctly closed and the security posture is sound. The findings above are worth addressing but none are blockers given the bounded blast radius (no production subpath sites, admin-gated onboarding). The IDN hostname question is the most important to verify.
| if (!url.hostname) { | ||
| throw new TypeError('Invalid baseURL: hostname is required'); | ||
| } | ||
| const sanitize = (s) => s.replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase(); |
There was a problem hiding this comment.
Important: the + quantifier collapses consecutive non-alphanumeric characters, which changes how punycode ACE labels (xn--) are sanitized. Old: xn--fiq228c.com -> xn--fiq228c-com. New: xn--fiq228c.com -> xn-fiq228c-com. If any onboarded site has a punycode hostname, its historical SharePoint data is orphaned. Either confirm no IDN hostnames exist in production, or apply run-collapse only to path segments.
| * dropped. Sanitized parts are joined with `--` as the path-segment delimiter. | ||
| * | ||
| * The `--` delimiter cannot appear inside a sanitized segment (any run of | ||
| * non-alphanumeric characters collapses to a single `-`), so URLs that differ |
There was a problem hiding this comment.
Important: this claim ("URLs that differ in path structure produce distinct folder names") is contradicted by the test which asserts /-/kings and /kings produce the same folder. The .filter(Boolean) at line 304 drops segments that sanitize to empty. Weaken the doc to acknowledge the empty-segment drop, or preserve empty segments as a placeholder.
| .map((seg) => { | ||
| let decoded = seg; | ||
| try { | ||
| decoded = decodeURIComponent(seg); |
There was a problem hiding this comment.
Important: decodeURIComponent returns whatever bytes the URL carried with no Unicode normalization. NFC precomposed characters sanitize differently than their NFD decomposed equivalents. Two visually identical URLs can produce different folders. One-line fix: decoded = decodeURIComponent(seg).normalize('NFC') before sanitize.
Summary
nba.com/kingsandnba.com/lakers) were mapping to the same SharePoint data folder becausegenerateDataFolderused only the hostnameFixes LLMO-4186
Test plan
prodanddevenvironments🤖 Generated with Claude Code