Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: how to customize the seperator #17388

Closed
1 task done
justinzyw opened this issue Dec 29, 2024 · 7 comments
Closed
1 task done

[Question]: how to customize the seperator #17388

justinzyw opened this issue Dec 29, 2024 · 7 comments
Labels
question Further information is requested

Comments

@justinzyw
Copy link

Question Validation

  • I have searched both the documentation and discord for an answer.

Question

I am using the following method to build an index. However, the problem is that the sentence splitter is not suitable for my input documents and I would like somehow to specific the seperators. How can I achieve this?. Below is my code:

documents = SimpleDirectoryReader(doc_dir).load_data()
index_to_build = VectorStoreIndex.from_documents(documents=documents)

@justinzyw justinzyw added the question Further information is requested label Dec 29, 2024
Copy link

dosubot bot commented Dec 29, 2024

To customize the separator in the sentence splitter when building an index using the VectorStoreIndex.from_documents method, you can pass a custom SentenceSplitter to the transformations list. Here's how you can do it:

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex

# Customize the SentenceSplitter with your desired separator
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10, separator="your_custom_separator")

# Use the custom SentenceSplitter in the transformations list
index_to_build = VectorStoreIndex.from_documents(
    documents=documents, transformations=[text_splitter]
)

This code snippet demonstrates how to create a custom SentenceSplitter with a specified separator and apply it when building the index [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@justinzyw
Copy link
Author

@dosu, I am using the token splitter and sentence splitter as bleow. However, they are not suitable for Chinese inputs. Could you please suggest other professional token splitters and sentence splitters specifically designed for Chinese inputs?

token_splitter = TokenTextSplitter(
separator=" ",
chunk_size=chunk_size_limit,
chunk_overlap=chunk_overlap
)

sentence_splitter = SentenceSplitter(
    separator="",
    chunk_size=chunk_size_limit,
    chunk_overlap=chunk_overlap
)

Settings.transformations = [token_splitter, sentence_splitter]

@justinzyw justinzyw reopened this Dec 30, 2024
Copy link

dosubot bot commented Dec 30, 2024

The TokenTextSplitter and SentenceSplitter in LlamaIndex can handle Chinese text, but they might not be optimized for it out of the box. For better handling of Chinese text, you might consider using third-party libraries that are specifically designed for Chinese text processing. Here are a few suggestions:

  1. Jieba: A popular Chinese text segmentation library that can be used for tokenization. You can integrate it with your current setup by using Jieba to preprocess your text before passing it to the TokenTextSplitter.

  2. THULAC: Developed by Tsinghua University, this library provides Chinese word segmentation and part-of-speech tagging.

  3. HanLP: A comprehensive library for Chinese natural language processing that includes tokenization, part-of-speech tagging, and more.

To integrate these libraries, you would typically preprocess your text using one of these tools to segment the text into tokens or sentences, and then use the segmented text with LlamaIndex's splitters. This approach allows you to leverage the strengths of specialized Chinese text processing tools while still using LlamaIndex for your indexing needs.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@justinzyw
Copy link
Author

@dosu, could you please write me sample codes which integrate Jieba into TokenTextSplitter?

Copy link

dosubot bot commented Dec 30, 2024

To integrate Jieba with the TokenTextSplitter in LlamaIndex, you can use Jieba to tokenize your Chinese text before passing it to the splitter. Here's a sample code demonstrating how to do this:

import jieba
from llama_index.core.node_parser.text.token import TokenTextSplitter

# Sample Chinese text
text = "这是一个用于测试的中文句子。"

# Use Jieba to tokenize the text
tokens = list(jieba.cut(text))

# Define a custom tokenizer function for TokenTextSplitter
def jieba_tokenizer(text):
    return list(jieba.cut(text))

# Create a TokenTextSplitter with the custom Jieba tokenizer
token_splitter = TokenTextSplitter(
    chunk_size=10,  # Adjust chunk size as needed
    chunk_overlap=2,  # Adjust chunk overlap as needed
    tokenizer=jieba_tokenizer
)

# Split the text using the TokenTextSplitter
chunks = token_splitter.split_text(" ".join(tokens))

# Output the resulting chunks
for chunk in chunks:
    print(chunk)

In this example, Jieba is used to tokenize the Chinese text, and a custom tokenizer function (jieba_tokenizer) is defined to integrate with the TokenTextSplitter. The text is first tokenized using Jieba, and then the tokens are joined with spaces to form a string that is passed to the TokenTextSplitter. Adjust the chunk_size and chunk_overlap parameters as needed for your specific use case.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@justinzyw
Copy link
Author

@dosu, please provide me another example for SentenceSplitter

Copy link

dosubot bot commented Dec 30, 2024

To integrate Jieba with the SentenceSplitter in LlamaIndex for handling Chinese text, you can use Jieba to segment the text into sentences before passing it to the splitter. Here's a sample code demonstrating how to do this:

import jieba
from llama_index.core.node_parser.text.sentence import SentenceSplitter

# Sample Chinese text
text = "这是一个用于测试的中文句子。这里是第二个句子。"

# Use Jieba to segment the text into sentences
# Note: Jieba is primarily for word segmentation, so you might need to define your own sentence segmentation logic
sentences = text.split('。')  # Simple example splitting by Chinese period

# Define a custom sentence tokenizer function
def jieba_sentence_tokenizer(text):
    return text.split('。')  # Adjust this logic as needed for more complex sentence splitting

# Create a SentenceSplitter with the custom Jieba sentence tokenizer
sentence_splitter = SentenceSplitter(
    chunk_size=10,  # Adjust chunk size as needed
    chunk_overlap=2,  # Adjust chunk overlap as needed
    chunking_tokenizer_fn=jieba_sentence_tokenizer
)

# Split the text using the SentenceSplitter
chunks = sentence_splitter.split_text(text)

# Output the resulting chunks
for chunk in chunks:
    print(chunk)

In this example, a simple sentence segmentation logic is used by splitting the text on the Chinese period (). You can adjust the jieba_sentence_tokenizer function to implement more sophisticated sentence segmentation if needed. The SentenceSplitter is then used with this custom tokenizer to process the text. Adjust the chunk_size and chunk_overlap parameters according to your requirements.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant