-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ospp/new llm chunk #19726
base: main
Are you sure you want to change the base?
Ospp/new llm chunk #19726
Conversation
As part of our document LLM support, we are introducing the `LLM_EXTRACT_TEXT` function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Code Suggestions ✨Explore these optional code suggestions:
|
|
User description
What type of PR is this?
Which issue(s) this PR fixes:
issue #18664
What this PR does / why we need it:
As part of our document LLM support, we are introducing the
LLM_CHUNK
function. This function can chunk the content in datalink with 4 chunk strategy available.Usage:
select llm_chunk("<input datalink>", "fixed_width; <width number>");
orselect llm_chunk("<input datalink>", "<sentence or paragraph or document>");
Return Value: a JSON-like string representation of an array of chunks with offset and size:
[[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]
Example SQL for fixed with:
Example return:
Example SQL for sentence:
Example return:
PR Type
Enhancement, Tests
Description
LLM_CHUNK
function to support chunking of text data using various strategies: fixed width, sentence, paragraph, and document.ChunkString
function, covering different chunking strategies and error scenarios.LLM_CHUNK
function in the function ID registry and added it to the list of supported built-in functions.LLM_CHUNK
function.Changes walkthrough 📝
3 files
func_llm.go
Implement LLM_CHUNK function with multiple chunking strategies
pkg/sql/plan/function/func_llm.go
LLM_CHUNK
function for chunking text.chunking strategies.
function_id.go
Register LLM_CHUNK function in function ID registry
pkg/sql/plan/function/function_id.go
LLM_CHUNK
function in function ID registry.list_builtIn.go
Add LLM_CHUNK to supported built-in functions
pkg/sql/plan/function/list_builtIn.go
LLM_CHUNK
to supported built-in functions.LLM_CHUNK
.7 files
func_llm_test.go
Add unit tests for LLM_CHUNK function
pkg/sql/plan/function/func_llm_test.go
ChunkString
function.func_llm_chunk.result
Add expected results for LLM_CHUNK function tests
test/distributed/cases/function/func_llm_chunk.result
LLM_CHUNK
function tests.func_llm_chunk.sql
Add SQL test cases for LLM_CHUNK function
test/distributed/cases/function/func_llm_chunk.sql
LLM_CHUNK
function.1.txt
Add test resource file for chunking tests
test/distributed/resources/llm_test/chunk/1.txt
2.txt
Add test resource file with Chinese characters
test/distributed/resources/llm_test/chunk/2.txt
3.txt
Add test resource file for paragraph chunking
test/distributed/resources/llm_test/chunk/3.txt
4.txt
Add test resource file for sentence chunking
test/distributed/resources/llm_test/chunk/4.txt
1 files
go.sum
Update project dependencies
go.sum