Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ospp/new llm chunk #19726

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Oct 31, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_CHUNK function. This function can chunk the content in datalink with 4 chunk strategy available.

Usage: select llm_chunk("<input datalink>", "fixed_width; <width number>"); or select llm_chunk("<input datalink>", "<sentence or paragraph or document>");

Return Value: a JSON-like string representation of an array of chunks with offset and size: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]

Example SQL for fixed with:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "fixed_width; 11");

Example return:

[[0, 11, "hello world"], [11, 11, " this is a "], [22, 11, "test? hello"], [33, 11, " world! thi"], [44, 11, "s is a test"], [55, 11, ". hello wor"], [66, 3, "ld."]] 

Example SQL for sentence:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "sentence");

Example return:

[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]

PR Type

Enhancement, Tests


Description

  • Implemented the LLM_CHUNK function to support chunking of text data using various strategies: fixed width, sentence, paragraph, and document.
  • Added comprehensive unit tests for the ChunkString function, covering different chunking strategies and error scenarios.
  • Registered the LLM_CHUNK function in the function ID registry and added it to the list of supported built-in functions.
  • Updated project dependencies to the latest versions.
  • Added SQL test cases and expected results for the LLM_CHUNK function.
  • Included test resource files for different chunking scenarios, including handling of Chinese characters.

Changes walkthrough 📝

Relevant files
Enhancement
3 files
func_llm.go
Implement LLM_CHUNK function with multiple chunking strategies

pkg/sql/plan/function/func_llm.go

  • Implemented LLM_CHUNK function for chunking text.
  • Added support for fixed width, sentence, paragraph, and document
    chunking strategies.
  • Introduced error handling for invalid chunk strategies.
  • +174/-0 
    function_id.go
    Register LLM_CHUNK function in function ID registry           

    pkg/sql/plan/function/function_id.go

    • Registered LLM_CHUNK function in function ID registry.
    +3/-0     
    list_builtIn.go
    Add LLM_CHUNK to supported built-in functions                       

    pkg/sql/plan/function/list_builtIn.go

  • Added LLM_CHUNK to supported built-in functions.
  • Defined overloads and execution logic for LLM_CHUNK.
  • +21/-0   
    Tests
    7 files
    func_llm_test.go
    Add unit tests for LLM_CHUNK function                                       

    pkg/sql/plan/function/func_llm_test.go

  • Added unit tests for ChunkString function.
  • Covered various chunking strategies and error cases.
  • +114/-0 
    func_llm_chunk.result
    Add expected results for LLM_CHUNK function tests               

    test/distributed/cases/function/func_llm_chunk.result

    • Added expected results for LLM_CHUNK function tests.
    +30/-0   
    func_llm_chunk.sql
    Add SQL test cases for LLM_CHUNK function                               

    test/distributed/cases/function/func_llm_chunk.sql

    • Added SQL test cases for LLM_CHUNK function.
    +24/-0   
    1.txt
    Add test resource file for chunking tests                               

    test/distributed/resources/llm_test/chunk/1.txt

    • Added test resource file for chunking tests.
    +1/-0     
    2.txt
    Add test resource file with Chinese characters                     

    test/distributed/resources/llm_test/chunk/2.txt

    • Added test resource file with Chinese characters.
    +1/-0     
    3.txt
    Add test resource file for paragraph chunking                       

    test/distributed/resources/llm_test/chunk/3.txt

    • Added test resource file for paragraph chunking.
    +3/-0     
    4.txt
    Add test resource file for sentence chunking                         

    test/distributed/resources/llm_test/chunk/4.txt

    • Added test resource file for sentence chunking.
    +3/-0     
    Dependencies
    1 files
    go.sum
    Update project dependencies                                                           

    go.sum

    • Updated dependencies for the project.
    +4/-4     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link
    Contributor

    mergify bot commented Oct 31, 2024

    ⚠️ The sha of the head commit of this PR conflicts with #18471. Mergify cannot evaluate rules on this PR. ⚠️

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    Sensitive information exposure:
    The LLMChunk function reads files based on user input (datalink). This could potentially lead to unauthorized access to sensitive files if proper access controls are not in place. Ensure that the file access is properly restricted and validated to prevent potential security vulnerabilities.

    ⚡ Recommended focus areas for review

    Performance Concern
    The implementation of fixedWidthChunk function uses string-to-rune conversion and slice operations, which may be inefficient for large inputs. Consider using a more efficient string manipulation method.

    Error Handling
    The LLMChunk function does not handle potential errors from the ChunkString function. Consider adding proper error handling and propagation.

    Resource Management
    The function opens a file using ReadFromFile but defers closing it inside a loop. This might lead to resource leaks if many files are processed. Consider moving the defer statement outside the loop or using a separate function for file handling.

    Copy link

    qodo-merge-pro bot commented Oct 31, 2024

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add input validation for the chunk width to prevent potential issues with invalid sizes

    Consider adding input validation for the width parameter in the fixedWidthChunk
    function to ensure it's greater than zero, preventing potential issues with invalid
    chunk sizes.

    pkg/sql/plan/function/func_llm.go [37-48]

    -func fixedWidthChunk(text string, width int) []Chunk {
    +func fixedWidthChunk(text string, width int) ([]Chunk, error) {
    +    if width <= 0 {
    +        return nil, fmt.Errorf("invalid width: %d, must be greater than zero", width)
    +    }
         var chunks []Chunk
         runes := []rune(text)
         for i := 0; i < len(runes); i += width {
             end := i + width
             if end > len(runes) {
                 end = len(runes)
             }
             chunks = append(chunks, Chunk{Start: i, Length: end - i, Text: string(runes[i:end])})
         }
    -    return chunks
    +    return chunks, nil
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Adding input validation for the width parameter in fixedWidthChunk is a crucial improvement to prevent invalid chunk sizes, which could lead to runtime errors. The suggestion is well-justified and the improved code correctly implements the validation.

    8
    Performance
    Use a more efficient string concatenation method for better performance when building the result string

    Consider using a more efficient string concatenation method, such as
    strings.Builder, instead of repeatedly using the + operator for building the result
    string in the ChunkString function.

    pkg/sql/plan/function/func_llm.go [114-119]

    -var chunkStrings []string
    -for _, chunk := range chunks {
    -    chunkStrings = append(chunkStrings, fmt.Sprintf("[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text))
    +var builder strings.Builder
    +builder.WriteString("[")
    +for i, chunk := range chunks {
    +    if i > 0 {
    +        builder.WriteString(", ")
    +    }
    +    fmt.Fprintf(&builder, "[%d, %d, %q]", chunk.Start, chunk.Length, chunk.Text)
     }
    +builder.WriteString("]")
    +return builder.String(), nil
     
    -return "[" + strings.Join(chunkStrings, ", ") + "]", nil
    -
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to use strings.Builder instead of repeatedly using the + operator for string concatenation is valid and can improve performance, especially with large data sets. The improved code accurately reflects the suggested change.

    7
    Use a more memory-efficient method to read file content, especially for large files

    Consider using a more efficient method to read the file content, such as
    bufio.Scanner or bufio.Reader, instead of io.ReadAll which reads the entire file
    into memory at once.

    pkg/sql/plan/function/func_llm.go [153-161]

    -ctx, err := io.ReadAll(r)
    -if err != nil {
    +var builder strings.Builder
    +scanner := bufio.NewScanner(r)
    +for scanner.Scan() {
    +    builder.WriteString(scanner.Text())
    +    builder.WriteString("\n")
    +}
    +if err := scanner.Err(); err != nil {
         return err
     }
     
    -input = string(ctx)
    +input = builder.String()
     if len(input) == 0 {
         return moerr.NewInvalidInputNoCtxf("Empty file is not valid")
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: The suggestion to use bufio.Scanner for reading file content is a reasonable optimization for memory efficiency. However, the current implementation with io.ReadAll may suffice unless dealing with very large files. The improved code correctly implements the suggested change.

    6

    💡 Need additional feedback ? start a PR chat

    Copy link
    Contributor

    mergify bot commented Oct 31, 2024

    ⚠️ The sha of the head commit of this PR conflicts with #18471. Mergify cannot evaluate rules on this PR. ⚠️

    @mergify mergify bot added the kind/feature label Oct 31, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    kind/feature Review effort [1-5]: 4 size/L Denotes a PR that changes [500,999] lines
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants