Skip to content

fix(llmo): include subpath in generateDataFolder to prevent SharePoint folder collisions#2315

Merged
alinarublea merged 18 commits intomainfrom
fix/subpath-data-folder
May 6, 2026
Merged

fix(llmo): include subpath in generateDataFolder to prevent SharePoint folder collisions#2315
alinarublea merged 18 commits intomainfrom
fix/subpath-data-folder

Conversation

@alinarublea
Copy link
Copy Markdown
Contributor

Summary

  • Subpath sites on the same domain (e.g. nba.com/kings and nba.com/lakers) were mapping to the same SharePoint data folder because generateDataFolder used only the hostname
  • Now includes the URL path in the folder name, so each subpath site gets a unique folder
  • Backward compatible: root-domain sites (no path) produce the same folder name as before

Fixes LLMO-4186

Test plan

  • New unit tests verify subpath sites produce distinct folder names
  • Existing tests confirm root-domain behavior is unchanged
  • Tests cover both prod and dev environments

🤖 Generated with Claude Code

…t folder collisions

Subpath sites on the same domain (e.g. nba.com/kings, nba.com/lakers) were
mapping to the same SharePoint folder because only the hostname was used.
Now the URL path is included so each subpath site gets a unique folder name.
Backward compatible: root-domain sites produce the same folder as before.

Fixes LLMO-4186

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor Author

@alinarublea alinarublea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Fix is correct and the test coverage is solid. Two observations:

url.hostname vs companion PR's url.host
This PR correctly uses url.hostname (no port). The companion fix in spacecat-shared PR #1577 uses url.host (includes port) for resolveCustomerSecretsName. The inconsistency means a subpath site with an explicit port would get a different secret key but the same folder name. Worth aligning both to url.hostname.

Missing test for nested paths
https://nba.com/us/kingsdev/nba-com-us-kings is the expected behavior but there's no test covering it. One extra assertion in the new test would be enough.

Neither is a blocker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

This PR will trigger a patch release when merged.

…generateDataFolder

- Split pathname into segments and join with '--' so 'nba.com/us/kings' and
  'nba.com/us-kings' produce distinct folder names
- Use split('/').filter(Boolean) to handle repeated slashes correctly
- Update JSDoc with folder name format and examples
- Add trailing-slash idempotence test for subpaths
- Add separator-collision test

Addresses review feedback on LLMO-4186

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@dzehnder dzehnder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the focused fix and the upfront self-review. The core change is correct and the tests cover the headline scenarios, but a few things came up around the new identity convention that warrant attention before this lands - mostly because generateDataFolder is the load-bearing identity for a customer's SharePoint data, and a folder collision or a silent rename has user-visible consequences.

Strengths

  • The fix correctly closes the headline collision: distinct subpath sites on the same hostname (nba.com/kings vs nba.com/lakers) now resolve to distinct folders (src/controllers/llmo/llmo-onboarding.js:272-273).
  • Backward compatibility for root-domain new onboardings is preserved by the pathSuffix ? ... : host ternary - root sites land in the same folder names as before. Importantly, composeBaseURL (the upstream caller) lower-cases the input, so pathSuffix.toLowerCase() is belt-and-suspenders but harmless.
  • The blast radius is right: existing onboarded sites read the persisted dataFolder from Site config, so this PR is largely a no-op for them. The fix only changes behavior on the paths that re-derive the folder from baseURL.
  • New tests at test/controllers/llmo/llmo-onboarding.test.js:385-405 cover subpath uniqueness, root trailing-slash equivalence, and nested subpaths in both prod and dev. Both points raised in the self-review (hostname/host alignment, nested-path test) are resolved before review was requested.

Issues

Critical (Must Fix)

  1. Separator ambiguity allows two distinct sites to land in the same SharePoint folder (src/controllers/llmo/llmo-onboarding.js:273)

    - is used both as the inner-character replacement AND as the host/path separator. The result is non-injective:

    baseURL A baseURL B both resolve to
    https://nba.com/us/kings https://nba.com/us-kings nba-com-us-kings
    https://nba.com/a/b https://nba.com/a-b nba-com-a-b
    https://x.com//double https://x.com/-double x-com--double

    validateSiteNotOnboarded only checks whether the computed folder already exists in SharePoint, so the second tenant onboarding a colliding URL either receives an "already onboarded" error attributed to another tenant (existence leak), or - if onboarded via the Slack path - is silently mapped onto the prior tenant's folder. SharePoint inheritance then exposes one customer's documents to another. This is the same bug class this PR set out to fix, just at a different boundary: two paying customers in one folder is a confidentiality breach.

    Pick a separator that cannot appear inside either segment - the smallest change is to keep - for inner replacement and switch the host/path join to _ (valid in SharePoint folder names):

    const dataFolderName = pathSuffix ? `${host}_${pathSuffix}` : host;

    Root-domain folders stay unchanged (nba-com). Add an injectivity test: nba.com/us/kings should not equal nba.com/us-kings.

  2. Migration impact for any already-onboarded subpath customer is unaddressed and breaks offboarding (src/controllers/llmo/llmo-onboarding.js:273, with downstream effect at src/controllers/llmo/llmo-onboarding.js:1478-1481)

    The PR description says "Backward compatible," but that holds only for sites with no path. Any customer currently onboarded with a non-root baseURL has their data under nba-com (host only). After this ships, every code path that re-derives the folder maps the same site to nba-com-kings. Specifically:

    • performLlmoOffboarding uses llmoConfig?.dataFolder ?? generateDataFolder(baseURL, env.ENV). If a legacy site's persisted dataFolder is missing, offboarding now computes the new name and calls deleteSharePointFolder on a folder that does not exist - leaving real customer data undeleted (compliance / GDPR concern) and silently completing offboarding while the data persists.
    • validateSiteNotOnboarded for any subsequent re-onboard would not find the new folder (still under the old name), pass validation, and create a fresh empty folder. The customer's prior dashboards live under the old folder; the API now writes under the new one. Split / partial data, no error surfaced.
    • appendRowsToQueryIndex and enqueueLlmoOnboardingPublish likewise resolve to the new name and produce broken cross-references.

    Two paths forward, pick one and document in the PR description / LLMO-4186:

    • Run a one-time DB query against Site for any record with a non-/ path in baseURL. If zero exist, document that explicitly on the PR and finding (2) becomes "ship with a follow-up to backfill llmoConfig.dataFolder so future re-derivations are stable."
    • If subpath sites do exist, persist their existing dataFolder into their LLMO config before deploy, so the llmoConfig?.dataFolder ?? ... fallback hits the cached value, then enable the new derivation only for new onboardings.

Important (Should Fix)

  1. JSDoc is stale and now actively misleading (src/controllers/llmo/llmo-onboarding.js:262-267)

    The pre-existing block reads @param {string} domain - The domain name, but the actual parameter is baseURL, and after this PR the function meaningfully consumes the URL path, not just the hostname. The PR touched the body but left the contract unchanged. Suggested:

    /**
     * Generates the SharePoint data folder name from a site's base URL.
     * Includes the URL path (if any) so subpath sites on the same hostname
     * map to distinct folders. Root-domain sites produce the same folder
     * name as before this change.
     * @param {string} baseURL - The site's base URL (must be a fully-qualified URL).
     * @param {string} env - The environment ('prod' or anything else, treated as dev).
     * @returns {string} The data folder name (prefixed with 'dev/' for non-prod).
     */

    This is the only documentation a future caller sees, and the parameter rename + behavior change should be reflected.

  2. Missing test for subpath trailing-slash equivalence (test/controllers/llmo/llmo-onboarding.test.js:393-397)

    The new trailing-slash idempotence test only covers root URLs. The same property holds for subpaths and is not asserted - even though the implementation depends on replace(/^\/|\/$/g, ''). A future refactor (e.g. swapping to pathname.split('/').filter(Boolean).join('-')) could silently change behavior for subpath URLs and create two folders for the same site. Trailing slashes on subpath URLs are very common in onboarding inputs (Slack pastes, CMS configs).

    expect(generateDataFolder('https://nba.com/kings/', 'prod'))
      .to.equal(generateDataFolder('https://nba.com/kings', 'prod'));

Minor (Nice to Have)

  1. Empty path segments, double slashes, and percent-encoded characters produce surprising folder names (src/controllers/llmo/llmo-onboarding.js:271-272)

    • https://nba.com//foo -> nba-com--foo (double-dash from leading empty segment; ^\/|\/$ only strips one slash)
    • https://nba.com/.well-known -> nba-com--well-known
    • https://nba.com/k%C3%B6nig -> nba-com-k-c3-b6nig (URL.pathname does not decode; the encoded form, the decoded https://nba.com/könig, and a developer-typed https://nba.com/koenig all produce different names)

    Combined with the separator-ambiguity item above, these expand the collision surface. Two cheap fixes: collapse runs of - after sanitization (.replace(/-+/g, '-')) and trim leading/trailing - from the path segment; optionally decodeURIComponent the path before sanitizing. Add the double-slash and dot-prefix cases as explicit tests.

  2. Pin edge-case behavior in the test suite (test/controllers/llmo/llmo-onboarding.test.js)

    No assertions for query strings, fragments, uppercase paths, or percent-encoded characters. Pinning current behavior (whatever it is) prevents an unrelated URL-handling change from silently shifting folder names later. 2-3 tightly-scoped assertions are enough.

Recommendations

  • Persist dataFolder as the source of truth. The architectural smell here is that generateDataFolder is treated as deterministic from baseURL forever, even though the persisted folder name in Site.config.llmo.dataFolder is the actual identity. A site changing its baseURL (TLD migration, www -> apex, http -> https, path canonicalization) would silently re-bind to a new folder. Treat generateDataFolder as a one-shot bootstrap function and have offboarding / republish / appendRows resolve from the persisted name only, failing loudly when it is missing rather than silently regenerating. That is a separate refactor, but worth filing.
  • Cross-link the two functions. This PR plus resolveCustomerSecretsName (the companion in spacecat-shared) now both encode "site -> stable identity" with different separators. Each is correct for its sink (SharePoint accepts hyphens; AWS Secrets Manager keys conventionally use _). A one-line comment in each pointing to its counterpart - "SharePoint-folder-safe slug; see resolveCustomerSecretsName for the Secrets-Manager-safe equivalent" - prevents future confusion cheaply.

Out of scope, worth tracking

  • The companion fix in spacecat-shared (resolveCustomerSecretsName) is correctly excluded here. The migration concern in finding 2 applies on the secrets side too, since secret keys move with the same baseURL change.
  • composeBaseURL in @adobe/spacecat-shared-utils accepts and silently transforms invalid inputs. After this PR, generateDataFolder consumes more of the URL than before, which makes that upstream validation gate more material. A separate hardening PR adding strict validation in onboardCustomer (src/controllers/llmo/llmo.js:946-984) would benefit every caller, not just this one - but that file is outside this PR's diff.
  • A shared siteIdentityComponents(baseURL) -> { host, pathSuffix } helper in spacecat-shared-utils, consumed by both generateDataFolder and resolveCustomerSecretsName, would let identity policy evolve in one place. Not for this PR.

Assessment

Ready to merge? With fixes.

Reasoning: the fix is correct for the targeted reproduction and the blast radius is right, but the new separator scheme reproduces the same class of cross-tenant collision the PR is trying to fix (different shape, same consequence), and the offboarding fallback at line 1481 silently corrupts data deletion for any legacy subpath site. Both are fixable in this repo with small diffs. The migration question can be answered with a single DB query against the Site table - if the answer is "no subpath sites in production today," critical finding 2 becomes a tracked follow-up rather than a blocker.

Next Steps

  1. Resolve the two Critical items: tighten the separator (separator-ambiguity item) and confirm/persist the migration story for legacy subpath sites.
  2. Update the JSDoc and add the subpath trailing-slash test.
  3. The Minor edge-case items can land here or in a small follow-up.

Comment thread src/controllers/llmo/llmo-onboarding.js Outdated
*
* Examples:
* https://nba.com -> nba-com
* https://nba.com/kings -> nba-com--kings
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical 1 - Separator ambiguity allows distinct sites to land in the same folder

- is used both as the inner-character replacement AND as the host/path separator. Result is non-injective:

baseURL A baseURL B both resolve to
https://nba.com/us/kings https://nba.com/us-kings nba-com-us-kings
https://nba.com/a/b https://nba.com/a-b nba-com-a-b
https://x.com//double https://x.com/-double x-com--double

validateSiteNotOnboarded only checks whether the computed folder already exists, so two tenants with colliding URLs either get an existence-leak error attributed to another tenant or are silently mapped to the same SharePoint folder (cross-tenant data exposure - same bug class this PR is meant to fix).

Smallest fix: keep - for inner replacement, switch the host/path join to _:

const dataFolderName = pathSuffix ? `${host}_${pathSuffix}` : host;

Root-domain folders stay unchanged. Add an injectivity test: nba.com/us/kings should not equal nba.com/us-kings.


Critical 2 - Migration impact for already-onboarded subpath customers

The "backward compatible" claim holds only for sites with no path. Any customer currently onboarded with a non-root baseURL has data under nba-com (host only). After this ships, every code path that re-derives the folder maps the same site to nba-com-kings:

  • performLlmoOffboarding (line 1478-1481) uses llmoConfig?.dataFolder ?? generateDataFolder(baseURL, env.ENV). If a legacy site's persisted dataFolder is missing, offboarding computes the new name and calls deleteSharePointFolder on a folder that does not exist - leaving real customer data undeleted (GDPR concern) while silently completing offboarding.
  • validateSiteNotOnboarded for re-onboard finds no folder under the new name, passes validation, and creates an empty one. Customer's prior dashboards live under the old folder; new writes go to the new one. Split/partial data, no error.
  • appendRowsToQueryIndex and enqueueLlmoOnboardingPublish resolve to the new name and produce broken cross-references.

Pick one path:

  • Run a DB query against Site for any record with a non-/ path in baseURL. If zero exist, document that on the PR and file a follow-up to backfill llmoConfig.dataFolder.
  • If subpath sites do exist, persist their existing dataFolder into their LLMO config before deploy so the offboarding fallback hits the cached value.

Minor - Empty path segments, double slashes, percent-encoded chars

  • nba.com//foo -> nba-com--foo (double-dash; ^\/|\/$ strips only one slash)
  • nba.com/.well-known -> nba-com--well-known
  • nba.com/k%C3%B6nig -> nba-com-k-c3-b6nig (URL.pathname does not decode; encoded vs decoded vs typed-equivalent all produce different names)

Cheap fixes: collapse runs of - (.replace(/-+/g, '-')), trim leading/trailing - from the path segment, optionally decodeURIComponent the path before sanitizing.

Comment thread src/controllers/llmo/llmo-onboarding.js Outdated
* The hostname and (if present) each URL path segment are individually sanitized
* (non-alphanumeric replaced with `-`, lowercased) and joined with `--` as a
* path-segment delimiter. The double-hyphen delimiter cannot appear in a sanitized
* segment, so distinct paths cannot collide.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - JSDoc (lines 262-267) is stale and now actively misleading

The pre-existing block reads @param {string} domain - The domain name, but the actual parameter is baseURL and after this PR the function meaningfully consumes the URL path, not just the hostname. Suggested:

/**
 * Generates the SharePoint data folder name from a site's base URL.
 * Includes the URL path (if any) so subpath sites on the same hostname
 * map to distinct folders. Root-domain sites produce the same folder
 * name as before this change.
 * @param {string} baseURL - The site's base URL (must be a fully-qualified URL).
 * @param {string} env - The environment ('prod' or anything else, treated as dev).
 * @returns {string} The data folder name (prefixed with 'dev/' for non-prod).
 */

A @since <version> note recording the format change would also help the next person debugging "why is this customer's folder different."

it('should produce the same folder name for root domain with or without trailing slash', async () => {
const { generateDataFolder } = await esmock('../../../src/controllers/llmo/llmo-onboarding.js', {});

expect(generateDataFolder('https://nba.com', 'prod')).to.equal(generateDataFolder('https://nba.com/', 'prod'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - Missing subpath trailing-slash equivalence test

The trailing-slash idempotence test only covers root URLs (this assertion). The same property holds for subpaths and is not asserted, even though the implementation depends on replace(/^\/|\/$/g, ''). A future refactor (e.g. swapping to pathname.split('/').filter(Boolean).join('-')) could silently change behavior for subpath URLs and create two folders for the same site. Trailing slashes on subpath URLs are very common in onboarding inputs (Slack pastes, CMS configs).

expect(generateDataFolder('https://nba.com/kings/', 'prod'))
  .to.equal(generateDataFolder('https://nba.com/kings', 'prod'));

…lding intent

- decodeURIComponent each path segment before sanitizing so %20 and
  equivalent decoded forms map to the same folder name
- Collapse consecutive - runs after sanitizing
- Document deliberate case-folding of path segments and updated param names in JSDoc
- Add tests for percent-encoded paths, uppercase paths, and double slashes

Addresses remaining review feedback on LLMO-4186

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@alinarublea
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @dzehnder. Addressing each item:

Critical #1 (separator ambiguity): Already fixed in a prior commit — nba.com/us/kingsnba-com--us--kings and nba.com/us-kingsnba-com--us-kings are distinct. The double-hyphen delimiter cannot appear in a sanitized segment. Added injectivity test.

Critical #2 (migration): No production Site currently has a non-empty path in baseURL — subpath support is new functionality introduced by LLMO-4186. The llmoConfig?.dataFolder ?? generateDataFolder(...) fallback in offboarding is safe: any newly onboarded subpath site will have the ---format dataFolder persisted in config, so the fallback path is never hit for them. For root-domain sites, the folder name is unchanged.

Important #1 (JSDoc): Updated with full description, case-folding note, and correct parameter name.

Important #2 (subpath trailing-slash test): Added.

Minor #1 (percent-encoding + run collapse): Added decodeURIComponent per segment with safe fallback, plus .replace(/-+/g, '-') to collapse consecutive hyphens. Added tests for %C3%B6, uppercase paths, and double slashes.

All 13 generateDataFolder tests pass.

…aFolder

The WHATWG URL parser always normalizes percent-encoding in pathnames, so
decodeURIComponent on url.pathname segments can never throw. Removing the
dead try/catch restores 100% branch coverage required by CI.
@alinarublea alinarublea requested a review from dzehnder May 4, 2026 15:02
Copy link
Copy Markdown
Contributor

@dzehnder dzehnder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the iteration - the collapse-runs change closes the path-segment collision class for the documented test cases, the JSDoc rewrite is solid, and the percent-encoding/case-folding/double-slash tests cover most of the previously-flagged edge cases. Two issues surfaced when we re-ran the function against adversarial inputs: a regression introduced by commit f96f1f4 (the decodeURIComponent catch removal), and the same host-vs-path asymmetry we found on the companion PR.

Strengths

Previously flagged issues now addressed:

  • Critical 1 (separator ambiguity) is closed for the documented test cases. The replace(/-+/g, '-') collapse-runs step plus the -- join means single-segment paths cannot synthesise a -- delimiter. nba.com/us/kings -> nba-com--us--kings is now distinct from any single-segment path.
  • Critical 2 (offboarding fallback) materially mitigated. Verified at src/controllers/llmo/llmo-onboarding.js:1304 that siteConfig.updateLlmoDataFolder(dataFolder.trim()) runs in performLlmoOnboarding before site.save(), so any subpath site onboarded after this PR ships has its dataFolder persisted; the fallback at line 1481 is dead code for new sites. Author's "no production Site has a non-empty path in baseURL today" claim is the load-bearing assumption and is reasonable given subpath onboarding is new with LLMO-4186.
  • Important 1 (JSDoc) is fully rewritten with format description, examples, parameter rename to baseURL, and the case-folding rationale. Accurate against the implementation for paths.
  • Important 2 (subpath trailing-slash test) added at test/controllers/llmo/llmo-onboarding.test.js:400-404.
  • Minor (edge-case coverage): percent-encoded path test, case-folding test, double-slash test all added.

Issues

Important (Should Fix)

  1. URIError regression on malformed percent-encoded baseURLs (src/controllers/llmo/llmo-onboarding.js:285)

    Commit f96f1f4 ("remove unreachable decodeURIComponent catch") removed the try/catch on the premise that "the WHATWG URL parser always normalizes percent-encoding in pathnames, so decodeURIComponent on url.pathname segments can never throw." That premise is incorrect. The URL parser preserves percent sequences but does not validate that they are well-formed. Verified by running the function:

    generateDataFolder('https://a.com/%FF')         -> URIError: URI malformed
    generateDataFolder('https://a.com/%E0%A4%A')    -> URIError: URI malformed
    generateDataFolder('https://a.com/%')           -> URIError: URI malformed
    

    Why it matters: baseURL is user-supplied (Slack onboarding command, HTTP onboarding API). A copy-paste with truncated or non-UTF-8 percent encoding now crashes the onboarding flow at line 1226 (and offboarding at line 1481 if persisted dataFolder is missing) with a 500 instead of producing a sanitized folder name. This is a regression from the prior commit which had a working fallback, and contradicts the author's own commitment in the PR comment ("Added decodeURIComponent per segment with safe fallback").

    Fix: restore the per-segment try/catch:

    .map((seg) => {
      let decoded;
      try { decoded = decodeURIComponent(seg); } catch { decoded = seg; }
      return decoded.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');
    });

    Add a regression test for 'https://a.com/%FF'. The "100% branch coverage" justification in the commit message is a CI concern that can be solved with /* c8 ignore next */ if the catch resists deterministic testing, but a deterministic test is straightforward.

  2. Hostname regex was not collapse-runs hardened; host-vs-path collision class remains (src/controllers/llmo/llmo-onboarding.js:283)

    The path regex now does collapse-runs via the chained .replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-'). The hostname regex on line 283 was not updated to do the same:

    const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();   // no collapse step

    So consecutive non-alphanumeric characters in a hostname produce -- runs, the same delimiter used to join host with path segments. Verified:

    https://nba--com/   -> nba--com
    https://nba/com     -> nba--com   COLLIDE
    https://a..b/       -> a--b
    https://a/b         -> a--b       COLLIDE
    

    The JSDoc claim at line 269 ("the double-hyphen delimiter cannot appear in a sanitized segment, so distinct paths cannot collide") therefore overstates what the code delivers. Severity bounded by (a) admin-only onboarding and (b) DNS will not resolve hostnames with consecutive empty labels, so this stays Important rather than Critical.

    Same fix as for the companion PR: add a collapse-runs step to the host:

    const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');

    And a regression test: generateDataFolder('https://nba--com/', 'prod') !== generateDataFolder('https://nba/com', 'prod').

Minor (Nice to Have)

  1. JSDoc claim is narrower than written (src/controllers/llmo/llmo-onboarding.js:269-270)

    Once the hostname is also collapse-runs hardened, the claim becomes accurate. If you choose to scope rather than tighten, change to: "The double-hyphen delimiter cannot appear in a sanitized path segment, so distinct paths on the same host cannot collide."

  2. Em-dash in JSDoc (src/controllers/llmo/llmo-onboarding.js:267)

    Line 267 contains a literal U+2014 em-dash byte sequence. Project house style is no em-dashes. Replace with - or restructure the parenthetical.

  3. Adversarial-collision tests are weaker than companion PR 1577 (test/controllers/llmo/llmo-onboarding.test.js:413-418)

    PR 1577 explicitly pins us..kings != us/kings AND us__kings != us/kings. PR 2315 only pins us-kings != us/kings. The implementation is correct (the run-collapse handles all three), but a regression that, say, removed the collapse would be caught by the companion PR and missed here for the .. and __ variants. Add us..kings and us__kings to the not-equal block to match the companion's coverage.

  4. JSDoc does not mention the run-collapse step (src/controllers/llmo/llmo-onboarding.js:266-268)

    The doc lists "percent-decoded and individually sanitized (non-alphanumeric replaced with -, lowercased)" but omits the .replace(/-+/g, '-') collapse, which is the load-bearing piece for separator-collision avoidance. Add "consecutive - collapsed" to the parenthetical.

Recommendations

  • The chained .replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-') is functionally equivalent to the single-pass .replace(/[^a-zA-Z0-9]+/g, '-').toLowerCase() used in PR 1577. Optional cleanup, not blocking. If you do consolidate, make sure to apply it to the host as well.
  • Worth tracking a follow-up: a shared siteIdentityComponents(baseURL) helper in spacecat-shared-utils (consumed by both this function and resolveCustomerSecretsName) so the algorithm only needs to be hardened in one place. Out of scope here.

Assessment

Ready to merge? With fixes.

Reasoning: The PR is a real security improvement and closes the originally reported separator-ambiguity bug for paths. The blocker is the URIError regression introduced by f96f1f4 - that crashes onboarding on a class of user input that previously worked, and contradicts the author's stated commitment. The host-vs-path asymmetry is a smaller fix but should land in the same commit since it mirrors the issue on the companion PR. After both, the JSDoc claim will be accurate and the implementation will be defensively correct.

Next Steps

  1. Restore the per-segment try/catch around decodeURIComponent. Add a malformed-percent-encoding test.
  2. Add the collapse-runs step to the hostname regex (and a host-vs-path regression test).
  3. The Minor items (JSDoc precision, em-dash, adversarial test parity) are cheap to land in the same commit.

Comment thread src/controllers/llmo/llmo-onboarding.js Outdated
const url = new URL(baseURL);
const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();
const segments = url.pathname.split('/').filter(Boolean)
.map((seg) => decodeURIComponent(seg).replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-'));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - URIError regression on malformed percent-encoded baseURLs

Commit f96f1f4 ("remove unreachable decodeURIComponent catch") removed the try/catch on the premise that the URL parser always normalizes percent-encoding in pathnames. That premise is incorrect: the WHATWG URL parser preserves percent sequences but does not validate that they're well-formed. Verified by running the function:

generateDataFolder('https://a.com/%FF')      -> URIError: URI malformed
generateDataFolder('https://a.com/%E0%A4%A') -> URIError: URI malformed
generateDataFolder('https://a.com/%')        -> URIError: URI malformed

baseURL is user-supplied (Slack onboarding, HTTP onboarding API). A truncated or non-UTF-8 percent-encoded copy-paste now crashes onboarding with a 500 instead of producing a sanitized folder name. This is a regression vs the prior commit, and contradicts the author's PR comment ("Added decodeURIComponent per segment with safe fallback").

Fix: restore the per-segment try/catch:

.map((seg) => {
  let decoded;
  try { decoded = decodeURIComponent(seg); } catch { decoded = seg; }
  return decoded.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');
});

Add a regression test: generateDataFolder('https://a.com/%FF', 'prod') does not throw. The 100% branch coverage justification in the commit message can be addressed with /* c8 ignore next */ if the catch resists deterministic testing, but a deterministic test is straightforward.

Comment thread src/controllers/llmo/llmo-onboarding.js Outdated
const { hostname } = new URL(baseURL);
const dataFolderName = hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();
const url = new URL(baseURL);
const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important - Hostname regex not collapse-runs hardened; host-vs-path collision class remains

The path regex now collapses runs (chained replace(/-+/g, '-') on line 285), but the hostname regex on this line was not updated. Consecutive non-alphanumeric characters in a hostname produce -- runs, the same delimiter used to join host with path segments. Verified:

https://nba--com/  -> nba--com
https://nba/com    -> nba--com   COLLIDE
https://a..b/      -> a--b
https://a/b        -> a--b       COLLIDE

The JSDoc claim at line 269 ("the double-hyphen delimiter cannot appear in a sanitized segment, so distinct paths cannot collide") overstates what the code delivers. Severity bounded by admin-only onboarding and the fact that DNS won't resolve hostnames with empty labels.

Fix: add a collapse-runs step to the host:

const host = url.hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase().replace(/-+/g, '-');

Add a regression test: generateDataFolder('https://nba--com/', 'prod') !== generateDataFolder('https://nba/com', 'prod').

Same issue exists on the companion PR adobe/spacecat-shared#1577 - flagged there too so both functions stay aligned.

Restore the per-segment try/catch around decodeURIComponent: the WHATWG
URL parser preserves syntactically valid percent sequences like %FF without
validating UTF-8 correctness, so decodeURIComponent can still throw on
non-UTF-8 encoded paths supplied by users. Removing it caused a regression.

Also add run-collapse to the hostname regex so consecutive non-alphanumeric
characters (e.g. nba--com) don't produce a -- delimiter that collides with
the host-to-path separator used to join path segments.

Add regression tests for both: malformed percent-encoding does not throw,
and hostnames with consecutive non-alnum chars don't collide with subpaths.
Add adversarial separator-parity tests to match companion PR 1577.

Fix em-dash (U+2014) in JSDoc and note run-collapse in the description.
@alinarublea alinarublea requested a review from dzehnder May 5, 2026 12:17
- Align regex to + quantifier (consistent with companion resolveCustomerSecretsName)
- Extract inline sanitize() with leading/trailing - trim and filter(Boolean)
  on segments, preventing degenerate all-hyphen segments from causing
  delimiter inflation or ambiguous folder names
- Update JSDoc: scope the injectivity claim honestly, document env semantics
- Fix percent-encoded segment test to use property assertion
- Hoist esmock import to a single before() for the entire describe block
- Add tests: port equality, query string/fragment ignored, degenerate segments
  dropped, throws on invalid URL
…erSecretsName

Guard against empty-hostname URLs (e.g. file:///) which would produce a
folder name starting with '--'. Adds a corresponding test. Consistent with
the http(s)/hostname validation added to resolveCustomerSecretsName.
Copy link
Copy Markdown
Contributor

@dzehnder dzehnder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the iteration. The URIError regression is fixed, host/path sanitize is now symmetric, and the JSDoc rewrite reflects the actual algorithm. Two Important items came up: the new .filter(Boolean) step introduces an empty-segment-drop collision class that the JSDoc claims doesn't exist, and the same NFD/NFC unicode-normalization gap that the companion PR has. Both are localized one-line fixes plus a regression test each. Two Minor JSDoc framing items round it out.

Strengths

Previously flagged issues now addressed:

  • URIError regression on malformed percent-encoded baseURLs (prior Important 1) is fixed. The restored try { decoded = decodeURIComponent(seg); } catch { /* keep raw */ } block at src/controllers/llmo/llmo-onboarding.js:294-300 mutates a let decoded and falls through to sanitize, so %FF no longer crashes. The should handle malformed percent-encoded path segments without throwing test (line 380) pins the contract.
  • Host-vs-path collision (prior Important 2) closed. The sanitize helper at line 291 (replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase()) is applied to BOTH the host and each path segment. The should not collide hostname with consecutive non-alnum chars and a subpath test (line 375) pins nba--com/ !== nba/com.
  • JSDoc rewrite (prior Minor 1, 4) documents the run-collapse step, the -- delimiter invariant, the lossy-punctuation limitation, and the dev/dev/ prefix mapping. Em-dashes (prior Minor 2) are absent from the new generateDataFolder JSDoc - verified by byte scan.
  • Adversarial collision tests (prior Minor 3) are now stronger than companion PR 1577: nba.com/us/kings is asserted distinct from us-kings, us..kings, AND us--kings.
  • Path-traversal via dataFolder is not feasible. Sanitize output alphabet is [a-z0-9-] plus literal dev/ prefix and -- delimiter. No /, ., null, or control chars survive sanitize. URL.pathname resolves literal .. segments at parse time. Verified dataFolder lands in /sites/elmo-ui-data/${dataFolder}/ (validateSiteNotOnboarded:404, copyFilesToSharepoint:731, deleteSharePointFolder:1029) - all safe given the sanitized alphabet.
  • Cross-tenant data-read class is bounded. validateSiteNotOnboarded (line 396, called from llmo.js:1024) checks the folder before SharePoint folder creation using the same generateDataFolder, so two tenants cannot simultaneously hold the same folder via this onboarding path.
  • Test refactor is in-scope and clean. The before(async () => { ({ generateDataFolder } = await esmock(...)) }) consolidation at test/controllers/llmo/llmo-onboarding.test.js:314 removes ~17 redundant per-test esmock calls. Tests assert public-contract properties (equal/not.equal/match), not internal regex shape - robust to future sanitize refactors. All 13 prior tests are carried over with identical inputs/expectations; nothing was inadvertently dropped.
  • Backward-compat verified for known callers (llmo.js:1019 onboarding, onboard-llmo-modal.js Slack flow, offboarding fallback at llmo-onboarding.js:1497). All compose dataFolder into URL paths where the sanitized output is URL-safe.

Issues

Important (Should Fix)

  1. .filter(Boolean) introduces a new collision class that the JSDoc claims does not exist (src/controllers/llmo/llmo-onboarding.js:301).

    The new commit added .filter(Boolean) after the segment sanitize, which silently drops any segment that reduces to empty. As a result, URLs that differ in path structure produce the same folder:

    • https://nba.com/-/kings -> nba-com--kings == https://nba.com/kings
    • https://nba.com/us/-/kings -> nba-com--us--kings == https://nba.com/us/kings
    • https://nba.com/.well-known/foo -> nba-com--well-known--foo (the .well-known decodes; degenerate cases below)
    • https://nba.com/--- -> nba-com == https://nba.com (host-only)

    This is a NEW collision class introduced in this round. The prior commit (c73276e) preserved bare-hyphen segments (/-/foo -> host--foo---foo), which did NOT collide. The JSDoc at lines 272-274 still claims "URLs that differ in path structure produce distinct folder names" - this is now provably wrong.

    Concretely:

    • The collision is bounded by validateSiteNotOnboarded (existing-folder gate fires before SharePoint folder creation), so it is NOT a cross-tenant data-read class.
    • It IS a squatting / onboarding-DoS vector: an attacker (or an admin onboarding via the Slack flow with admin gating) who claims https://victim.com/foo/%2e%2e/bar first registers the folder victim-com--foo--bar. The legitimate https://victim.com/foo/bar onboarding is then rejected at line 419 and pages #llmo-alerts. Recovery requires manual intervention.

    Pick one:

    • Reject rather than drop empty-after-sanitize segments: replace .filter(Boolean) with if (segments.some((s) => !s)) throw new Error('baseURL contains a path segment that sanitizes to empty'). Eliminates the class entirely; rejects a tiny number of degenerate baseURLs that probably shouldn't onboard anyway.
    • Or accept the collision and weaken the JSDoc to symmetric treatment alongside the punctuation-only class ("path segments differing only in punctuation, OR a path segment that sanitizes to empty, are treated as identical").

    Add an adversarial test pinning whichever behavior you choose. The current should drop degenerate path segments that sanitize to empty test only pins one input shape; broaden it (/-/kings == /---/kings, /us/-/kings == /us/kings) so a future filter-rule change is caught.

  2. NFD vs NFC unicode forms produce different folders (src/controllers/llmo/llmo-onboarding.js:291, the sanitize helper).

    The new percent-encoded test uses NFC for both literal and encoded forms, so it passes. NFD is not exercised. decodeURIComponent does no unicode normalization, so visually-identical inputs in different forms produce different folders:

    • https://nba.com/könig (NFC ö = U+00F6) -> nba-com--k-nig
    • https://nba.com/könig (NFD o + U+0308 combining diaeresis) -> nba-com--ko-nig

    This is the same class of silent silent-divergence the PR is closing in the opposite direction (different folders for the same logical tenant). Realistic when authors paste tenant subpaths from different sources (CMS, Word, macOS Finder which uses NFD on HFS+, iOS clipboard). Same finding flagged on companion PR 1577 - keeping the two functions consistent matters because tenants exist in both code paths.

    One-line fix: decoded = decoded.normalize('NFC') (or fold into sanitize). Plus one regression test pinning NFC === NFD equivalence.

Minor (Nice to Have)

  1. decodeURIComponent catch-path adds a narrower collision class (src/controllers/llmo/llmo-onboarding.js:297). When decodeURIComponent(seg) throws and decoded falls through to the raw segment (e.g. %FF), sanitize('%FF') strips the % as non-alphanumeric and returns ff. So https://a.com/%FF collides with https://a.com/ff, and https://a.com/foo%80bar collides with https://a.com/foo-bar. This is the deliberate trade-off for the prior URIError-DoS fix (collision over throw). Same squatting/DoS vector as the empty-segment class above, narrower input shape. Worth a one-line JSDoc addition acknowledging this, parallel to the existing punctuation note.

  2. JSDoc framing: "inherent limitation of lossy sanitization" understates the design choice (src/controllers/llmo/llmo-onboarding.js:274-277). The collision class isn't inherent - it's a deliberate readable-slug-over-injectivity choice. A slug + short content-hash scheme (e.g. nba-com--us-kings--ab12cd34 where the hash is over the canonical URL) would preserve injectivity while keeping the readable prefix. Same observation flagged on companion PR 1577. Worth one-line wording: "limitation of the chosen readable-slug design (a hash-suffixed identity would avoid this at the cost of readability)." No behavior change requested for this PR; just keeping the architectural alternative visible to future readers.

  3. Add @throws {TypeError} to JSDoc (src/controllers/llmo/llmo-onboarding.js:280-285). The new test should throw on a malformed base URL pins that generateDataFolder('not a url', 'prod') throws TypeError, but the JSDoc only says "must be a fully-qualified URL" without naming the exception. Minor JSDoc-completeness gap.

Recommendations

  • For the empty-segment-drop class: option (a) "reject" is structurally cleaner and removes the class entirely. Option (b) "accept and document" is fine if multi-tenant subpath onboarding is gated behind admin Slack flow with low attacker reachability. The LLMO-4186 threat model can guide that choice.
  • For NFD/NFC: prefer normalize over document. One line, one test, one less silent-failure class.
  • IDN/punycode hostnames (informational, NOT a blocker): WHATWG URL.hostname returns ACE form (xn--…), and run-collapse on host turns xn--wgv71a.example.com into xn-wgv71a-example-com, which collides with a hypothetical xn-wgv71a.example.com. Same squatting class as above, very narrow attacker reachability. Acceptable as-is given the round-1 host-vs-path collision fix required this collapse.

Out of scope, worth tracking

  • Companion PR adobe/spacecat-shared#1577 (the SECRETS-side equivalent) carries an identical sanitize shape with _/__ separators. The two PRs intentionally use different separators so tenants who exist in both code paths produce orthogonal folder names - by design. A shared siteIdentityComponents(baseURL) helper would let the algorithm evolve in one place; out of scope here.
  • TOCTOU race between validateSiteNotOnboarded's folder.exists() and copyFilesToSharepoint's folder.createFolder() exists at HEAD f8b06692, exists at c73276e, predates this PR. Not introduced by this change; flag for a separate hardening PR if it ever bites.
  • The llmo-referral-traffic.js and llmo-referral-traffic.test.js diffs in this PR's incremental came in via merge from main (PR 2319, LLMO-4261). NOT introduced by this author, NOT in scope.

Assessment

Ready to merge? With fixes.

Reasoning: the cross-tenant data-read class - the original Critical that this PR was opened to fix - is correctly closed by the host+segment sanitize, the -- delimiter, and the validateSiteNotOnboarded gate. The two Important findings are localized: a 1-3 line code change for empty-segment behavior plus broader test coverage, and a one-line .normalize('NFC') plus one regression test. The Minor items are JSDoc tweaks. After they land, this is approve as-is; the architectural and security posture for the LLMO-4186 onboarding population is sound.

Next Steps

  1. Pick the empty-segment policy (reject or accept-and-document) and broaden the test.
  2. Add .normalize('NFC') to sanitize and a regression test.
  3. The Minor items (catch-path JSDoc note, "inherent" framing tweak, @throws {TypeError}) can land in the same commit.

let decoded = seg;
try {
decoded = decodeURIComponent(seg);
} catch { /* keep raw on percent-encoded sequences that are not valid UTF-8 */ }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this .filter(Boolean) introduces a new collision class that the JSDoc (lines 272-274 "URLs that differ in path structure produce distinct folder names") claims does not exist:

  • https://nba.com/-/kings -> nba-com--kings == https://nba.com/kings
  • https://nba.com/us/-/kings == https://nba.com/us/kings
  • https://nba.com/--- -> nba-com == https://nba.com

New this round (the prior commit c73276e preserved bare-hyphen segments). Bounded by validateSiteNotOnboarded so it's not cross-tenant data-read, but it IS a squatting / onboarding-DoS vector: an attacker who claims https://victim.com/foo/%2e%2e/bar first locks the legitimate https://victim.com/foo/bar out at line 419.

Pick: replace .filter(Boolean) with if (segments.some((s) => !s)) throw new Error('baseURL contains a path segment that sanitizes to empty') (eliminates the class), OR keep the drop and weaken the JSDoc to symmetric treatment alongside the punctuation-only class. Either way, broaden the existing drop degenerate path segments test to pin multi-position cases (/us/-/kings == /us/kings, /-/kings == /---/kings).

const { hostname } = new URL(baseURL);
const dataFolderName = hostname.replace(/[^a-zA-Z0-9]/g, '-').toLowerCase();
const url = new URL(baseURL);
if (!url.hostname) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: NFD vs NFC unicode forms produce different folders for visually identical URLs. decodeURIComponent doesn't normalize:

  • https://nba.com/könig (NFC ö = U+00F6) -> nba-com--k-nig
  • https://nba.com/könig (NFD o + U+0308) -> nba-com--ko-nig

Same class of silent silent-divergence the PR is closing in the opposite direction (different folders for the same logical tenant). Realistic when authors paste subpaths from different sources (CMS, Word, macOS Finder, iOS clipboard).

One-line fix: decoded = decoded.normalize('NFC') before sanitize, plus one regression test pinning NFC === NFD. Same finding raised on companion PR 1577.

const sanitize = (s) => s.replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase();
const host = sanitize(url.hostname);
const segments = url.pathname.split('/').filter(Boolean)
.map((seg) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: when decodeURIComponent(seg) throws and decoded falls through to the raw segment, sanitize('%FF') strips the % as non-alphanumeric and returns ff. So https://a.com/%FF collides with https://a.com/ff, and https://a.com/foo%80bar collides with https://a.com/foo-bar. Deliberate trade-off for the prior URIError-DoS fix, but worth a one-line JSDoc addition parallel to the existing punctuation-collision note: "Path segments containing invalid percent-encoding sequences are sanitized as their raw bytes and may collide with the stripped form."

@alinarublea alinarublea requested a review from dzehnder May 5, 2026 13:54
Copy link
Copy Markdown
Contributor

@danieljchuser danieljchuser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @alinarublea,

Thanks for the continued iteration. The core fix is solid - the -- delimiter, run-collapse sanitizer, and validateSiteNotOnboarded gate close the original cross-tenant collision class. The test suite is thorough and the JSDoc rewrite documents the sanitization algorithm honestly. Three Important items came up (one new, two carried forward) and two Minor items - none are blockers but all are worth addressing.

Strengths

  • Path-traversal is eliminated by construction: the sanitize output alphabet is [a-z0-9-] plus the static dev/ prefix and -- delimiter. No ., .., /, NUL, or SharePoint path separators can survive. Verified across callers (validateSiteNotOnboarded, copyFilesToSharepoint, deleteSharePointFolder).
  • The try/catch around decodeURIComponent at src/controllers/llmo/llmo-onboarding.js:295-299 correctly defends against URIError from invalid UTF-8 sequences without crashing onboarding.
  • The hostname guard at src/controllers/llmo/llmo-onboarding.js:283-285 rejects data:, javascript:, file: and other opaque/no-host schemes early. Fail-closed is the right posture for a tenant-identity primitive.
  • Test suite covers 20+ scenarios including trailing slashes, double slashes, port stripping, case-folding, percent normalization, query/fragment stripping, and degenerate segments. The before() consolidation removes redundant per-test esmock calls cleanly.
  • Previously flagged issues now addressed: separator ambiguity (Critical, round 1), URIError regression (Important, round 3), hostname regex hardening (Important, round 3), stale JSDoc (round 1).

Issues

Important (Should Fix)

  1. Backward-compat regression for IDN/punycode hostnames - src/controllers/llmo/llmo-onboarding.js:294

    The + quantifier in sanitize collapses consecutive non-alphanumeric characters in the hostname into a single -. Punycode ACE labels start with xn--, so any IDN domain gets a different folder name than the old code produced:

    • Old: xn--fiq228c.com -> xn--fiq228c-com
    • New: xn--fiq228c.com -> xn-fiq228c-com

    Prior rounds confirmed "no production subpath sites" but did not confirm "no IDN hostnames." If any onboarded site has a punycode hostname, its historical SharePoint data is orphaned under the old folder name.

    Fix: either confirm no IDN hostnames exist in production (document on the PR), or apply the run-collapse only to path segments while keeping per-character replacement for the hostname. Add a regression test asserting xn--fiq228c.com preserves the -- in the folder name.

  2. JSDoc injectivity claim contradicts tested behavior - src/controllers/llmo/llmo-onboarding.js:272-274

    The doc says "URLs that differ in path structure produce distinct folder names." The test "should drop degenerate path segments that sanitize to empty" asserts that /-/kings and /kings produce the same folder. The .filter(Boolean) at line 304 drops segments that sanitize to empty, so two distinct path structures yield identical output.

    Fix: weaken the JSDoc to acknowledge the empty-segment drop: "Path segments that sanitize to empty (e.g. /-/) are dropped, so /-/kings and /kings produce the same folder."

  3. NFC vs NFD unicode normalization gap - src/controllers/llmo/llmo-onboarding.js:300

    decodeURIComponent returns whatever bytes the URL carried. NFC precomposed characters sanitize differently than their NFD decomposed equivalents. Two visually identical URLs can produce different folders. Realistic when authors paste subpaths from different sources (CMS, macOS Finder which uses NFD, iOS clipboard).

    Fix: add .normalize('NFC') before sanitize:

    decoded = decodeURIComponent(seg).normalize('NFC');

Minor (Nice to Have)

  1. Catch-path fallback undocumented - src/controllers/llmo/llmo-onboarding.js:297-301. When decodeURIComponent throws, the raw %XX form is kept and sanitized, so %FF -> ff collides with literal /ff. Deliberate trade-off for the URIError-DoS fix. Worth a one-line JSDoc note.

  2. Missing @throws {TypeError} in JSDoc - src/controllers/llmo/llmo-onboarding.js:284-287. Two throw paths exist and are tested. Add @throws {TypeError} When baseURL is not a valid URL or has no hostname.

Recommendations

  • Consider extracting the sanitize helper to a named function so it can be unit-tested directly. With the collision classes now documented, tighter testing of sanitize itself would catch regressions earlier.
  • Companion PR adobe/spacecat-shared#1577 carries near-identical sanitize logic with _/__ separators. A shared sanitizeUrlToSlug(baseURL, { separator }) helper would prevent drift and give one place to add NFC normalization.

Assessment

Approved - the cross-tenant collision class is correctly closed and the security posture is sound. The findings above are worth addressing but none are blockers given the bounded blast radius (no production subpath sites, admin-gated onboarding). The IDN hostname question is the most important to verify.

if (!url.hostname) {
throw new TypeError('Invalid baseURL: hostname is required');
}
const sanitize = (s) => s.replace(/[^a-zA-Z0-9]+/g, '-').replace(/^-+|-+$/g, '').toLowerCase();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: the + quantifier collapses consecutive non-alphanumeric characters, which changes how punycode ACE labels (xn--) are sanitized. Old: xn--fiq228c.com -> xn--fiq228c-com. New: xn--fiq228c.com -> xn-fiq228c-com. If any onboarded site has a punycode hostname, its historical SharePoint data is orphaned. Either confirm no IDN hostnames exist in production, or apply run-collapse only to path segments.

* dropped. Sanitized parts are joined with `--` as the path-segment delimiter.
*
* The `--` delimiter cannot appear inside a sanitized segment (any run of
* non-alphanumeric characters collapses to a single `-`), so URLs that differ
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: this claim ("URLs that differ in path structure produce distinct folder names") is contradicted by the test which asserts /-/kings and /kings produce the same folder. The .filter(Boolean) at line 304 drops segments that sanitize to empty. Weaken the doc to acknowledge the empty-segment drop, or preserve empty segments as a placeholder.

.map((seg) => {
let decoded = seg;
try {
decoded = decodeURIComponent(seg);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: decodeURIComponent returns whatever bytes the URL carried with no Unicode normalization. NFC precomposed characters sanitize differently than their NFD decomposed equivalents. Two visually identical URLs can produce different folders. One-line fix: decoded = decodeURIComponent(seg).normalize('NFC') before sanitize.

@alinarublea alinarublea deployed to dev-branches May 6, 2026 10:12 — with GitHub Actions Active
@alinarublea alinarublea merged commit 5f817e2 into main May 6, 2026
18 checks passed
@alinarublea alinarublea deleted the fix/subpath-data-folder branch May 6, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants