Skip to content

Conversation

@eLysgaard
Copy link
Contributor

@eLysgaard eLysgaard commented Dec 11, 2025

TLDR

Adds first-class support for KA citations with PDF files. The highlighted text and page number is now shown on hover, and the PDF can be previewed in app by clicking on it

chat-template-ka-citation-support.mov

Summary

  • Add in-app PDF preview that opens in a slide-over sheet when clicking Unity Catalog file citations
  • Implement backend file proxy (/api/files/databricks-file) to securely stream PDFs through the server with proper Databricks
    authentication
  • Bundle PDF.js worker with Vite for reliable PDF rendering

Features

  • PDF Preview Sheet: Slide-over panel displaying PDF files with page navigation
  • Citation Link Component: Clickable links that open the preview instead of navigating away
  • Cited Text Display: Collapsible section showing the quoted text from the citation
  • Error Handling: Graceful handling of 404, permission errors, and load failures with retry option
  • Unity Catalog Integration: Links to open files directly in Databricks Explorer

Technical Details

  • Uses react-pdf library for PDF rendering
  • PDF.js worker bundled via vite-plugin-static-copy for production reliability
  • Backend proxies Databricks Files API requests with authentication
  • Comprehensive test coverage for the files route (230 lines of tests)

Files Changed

  • client/src/components/pdf-preview/ - New PDF preview components
  • client/src/lib/pdf-utils.ts - PDF utility functions
  • server/src/routes/files.ts - Backend file proxy endpoint
  • tests/routes/files.test.ts - Integration tests for files route

Test Plan

  • Click a PDF citation link and verify the preview opens
  • Navigate between pages using prev/next buttons
  • Verify error states show correctly (404, permission denied)
  • Test "Open in Databricks" and download buttons
  • Verify cited text section expands/collapses

eLysgaard and others added 10 commits December 9, 2025 13:38
- Add react-pdf for PDF rendering in a side drawer
- Create backend proxy (/api/files/volumes) for authenticated PDF fetching
- Parse UC PDF links and extract page/text fragment from URL hash
- Show filename, page number, and cited text in tooltip
- Include download and "Open in Catalog" buttons in preview header

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Replace @react-pdf-viewer with react-pdf (more actively maintained)
- Add manual text highlighting for cited text fragments
- Only highlight on the initial page from citation
- Change backend endpoint to POST /api/files/databricks-file
- Use official Databricks Files API path (/api/2.0/fs/files/)
- Fetch PDF via POST with path in body, return blob URL
- Add proper error handling for streams to prevent server crashes
- Add CSS styles for highlighted text spans

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Create tests/routes/files.test.ts with 15 test cases covering:
  - Request validation (missing/invalid path)
  - File type handling (PDF, JSON, TXT, images, unknown)
  - Databricks API integration (404, 403, 500 errors)
  - Streaming and header forwarding
  - Path normalization and edge cases
- Add Databricks Files API mock handlers to api-mock-handlers.ts
- All tests pass (15 passed, 1 skipped)
- Skip unauthenticated test due to test environment auth defaults
- Add collapsible section in PDF preview sheet header to show original cited text
- Display cited text in a styled box with proper whitespace preservation
- Add toggle functionality with chevron icons (defaults to expanded)
- Only shows when highlightText prop is provided
- Helps users understand what specific content was referenced in citations
- Uninstalled @react-pdf-viewer/highlight package (no longer needed)
- Removed pdf-highlight.css file and import
- Added vite-plugin-static-copy to bundle PDF.js worker file
- Updated Vite config to copy worker from node_modules to /assets/
- Changed worker source from CDN to bundled path for offline support

The PDF.js worker is now bundled with the application instead of
being loaded from unpkg CDN, improving reliability and removing
external dependencies.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
Remove unused highlighting code that was accidentally left in the
PDFViewer component. Highlighting will be added in a separate PR.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Ensure reliable worker file path in vite-plugin-static-copy by adding
pdfjs-dist as an explicit dependency instead of relying on it being
hoisted from react-pdf's node_modules.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add right padding to header buttons to prevent overlap with the
  sheet's close button
- Fix "Open in Catalog" link to use full workspace URL instead of
  relative path

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Show the page number in the cited text section header to help users
understand which page the citation is from.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <[email protected]>
@eLysgaard eLysgaard self-assigned this Dec 11, 2025
Comment on lines +24 to +30
<button
type="button"
onClick={handleClick}
className="cursor-pointer rounded-md bg-muted-foreground px-2 py-0 font-medium text-zinc-200 underline"
>
{children}
</button>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howcome we are not using the button component that we use elsewhere?
import { Button } from '@/components/ui/button';

Comment on lines +100 to +104
const [error, setError] = useState<PDFError | null>(null);
const [retryKey, setRetryKey] = useState(0);
const [blobUrl, setBlobUrl] = useState<string | null>(null);
const [isLoading, setIsLoading] = useState(false);
const [showCitedText, setShowCitedText] = useState(true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth considering using a reducer instead of this many states.
Seems like the first four states form an implicit state machine:

  • Idle → Loading → Success (with blobUrl) | Error
  • Error → Retry → Loading

Several places update multiple states together:

  • loadPdf start: setIsLoading(true), setError(null)
  • loadPdf success: setBlobUrl(url), (implicitly setIsLoading(false))
  • loadPdf error: setError(...), setIsLoading(false)
  • handleRetry: setError(null), setBlobUrl(null), setRetryKey(prev + 1)

A reducer would make these transitions explicit and prevent impossible states (e.g., having both error and blobUrl set simultaneously).

return (
<Sheet open={open} onOpenChange={handleOpenChange}>
<SheetContent
className="flex w-[70vw] max-w-4xl flex-col gap-0 p-0 sm:max-w-4xl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont we have the same class name twice?

Comment on lines +134 to +139
const message = err instanceof Error ? err.message.toLowerCase() : '';
if (message.includes('404')) {
setError({ type: 'NotFoundError' });
} else if (message.includes('403')) {
setError({ type: 'PermissionError' });
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of string matching on errors. It would be better if our fetchDatabricksFile threw typed errors :/

);

// Fetch the PDF when the sheet opens
useEffect(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing with the two separate useEffects that manage blob cleanup. First one revokes currentBlobUrl, second one also revokes it when the sheet closes. Might make sense to consider consolidation even if a double-revoke isn't a big problem

Comment on lines +77 to +79
for (const entry of entries) {
setContainerWidth(entry.contentRect.width);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we actually ever have multiple entries?

Comment on lines +58 to +61
} else if (hash.startsWith(':~:text=')) {
// Handle case where hash only contains text fragment without page
textFragment = decodeURIComponent(hash.substring(':~:text='.length));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only triggers when hash is exactly :~:text= (empty text fragment) because the regex .+ requires at least one character, right? would this ever occur in practice? should we throw an error something instead?

const filePath = `${match[1]}/${match[2]}`;

// Keep the original URL for reference (e.g., for "Open in Catalog" link)
const originalUrl = new URL(href, window.location.origin);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the exact same code as line 24?

Comment on lines +125 to +129
if (!response.ok) {
const errorData = await response.json().catch(() => ({}));
const message = errorData.error || response.statusText;
throw new Error(`${response.status}: ${message}`);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could throw typed errors here

Comment on lines +33 to +53
const { path: filePath } = req.body as { path?: string };

if (!filePath || typeof filePath !== 'string') {
return res
.status(400)
.json({ error: 'File path is required in request body' });
}

// Get Databricks host - prefer cached CLI host, fall back to env
let hostUrl = getCachedCliHost();
if (!hostUrl) {
hostUrl = getHostUrl();
}

// Construct the Databricks Files API URL
// Path should start with /Volumes/ for Unity Catalog volumes
// See: https://docs.databricks.com/api/workspace/files/download
const normalizedPath = filePath.startsWith('/')
? filePath
: `/${filePath}`;
const databricksUrl = `${hostUrl}/api/2.0/fs/files${normalizedPath}`;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment says the path should start with volumes, wouldn't it make sense to validate it?
Something like

  if (!filePath.match(/^\/?(Volumes\/[^.][^/]*\/[^/]+\/[^/]+\/.+)$/)) {
    return res.status(400).json({ error: 'Invalid file path format' });
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants