[Question]: how to customize the seperator #17388

justinzyw · 2024-12-29T13:12:57Z

Question Validation

I have searched both the documentation and discord for an answer.

Question

I am using the following method to build an index. However, the problem is that the sentence splitter is not suitable for my input documents and I would like somehow to specific the seperators. How can I achieve this?. Below is my code:

documents = SimpleDirectoryReader(doc_dir).load_data()
index_to_build = VectorStoreIndex.from_documents(documents=documents)

dosubot · 2024-12-29T13:13:59Z

To customize the separator in the sentence splitter when building an index using the VectorStoreIndex.from_documents method, you can pass a custom SentenceSplitter to the transformations list. Here's how you can do it:

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex

# Customize the SentenceSplitter with your desired separator
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10, separator="your_custom_separator")

# Use the custom SentenceSplitter in the transformations list
index_to_build = VectorStoreIndex.from_documents(
    documents=documents, transformations=[text_splitter]
)

This code snippet demonstrates how to create a custom SentenceSplitter with a specified separator and apply it when building the index ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

justinzyw · 2024-12-30T01:24:36Z

@dosu, I am using the token splitter and sentence splitter as bleow. However, they are not suitable for Chinese inputs. Could you please suggest other professional token splitters and sentence splitters specifically designed for Chinese inputs?

token_splitter = TokenTextSplitter(
separator=" ",
chunk_size=chunk_size_limit,
chunk_overlap=chunk_overlap
)

sentence_splitter = SentenceSplitter(
    separator="",
    chunk_size=chunk_size_limit,
    chunk_overlap=chunk_overlap
)

Settings.transformations = [token_splitter, sentence_splitter]

dosubot · 2024-12-30T01:25:41Z

The TokenTextSplitter and SentenceSplitter in LlamaIndex can handle Chinese text, but they might not be optimized for it out of the box. For better handling of Chinese text, you might consider using third-party libraries that are specifically designed for Chinese text processing. Here are a few suggestions:

Jieba: A popular Chinese text segmentation library that can be used for tokenization. You can integrate it with your current setup by using Jieba to preprocess your text before passing it to the TokenTextSplitter.
THULAC: Developed by Tsinghua University, this library provides Chinese word segmentation and part-of-speech tagging.
HanLP: A comprehensive library for Chinese natural language processing that includes tokenization, part-of-speech tagging, and more.

To integrate these libraries, you would typically preprocess your text using one of these tools to segment the text into tokens or sentences, and then use the segmented text with LlamaIndex's splitters. This approach allows you to leverage the strengths of specialized Chinese text processing tools while still using LlamaIndex for your indexing needs.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

justinzyw · 2024-12-30T01:34:01Z

@dosu, could you please write me sample codes which integrate Jieba into TokenTextSplitter？

dosubot · 2024-12-30T01:35:30Z

To integrate Jieba with the TokenTextSplitter in LlamaIndex, you can use Jieba to tokenize your Chinese text before passing it to the splitter. Here's a sample code demonstrating how to do this:

import jieba
from llama_index.core.node_parser.text.token import TokenTextSplitter

# Sample Chinese text
text = "这是一个用于测试的中文句子。"

# Use Jieba to tokenize the text
tokens = list(jieba.cut(text))

# Define a custom tokenizer function for TokenTextSplitter
def jieba_tokenizer(text):
    return list(jieba.cut(text))

# Create a TokenTextSplitter with the custom Jieba tokenizer
token_splitter = TokenTextSplitter(
    chunk_size=10,  # Adjust chunk size as needed
    chunk_overlap=2,  # Adjust chunk overlap as needed
    tokenizer=jieba_tokenizer
)

# Split the text using the TokenTextSplitter
chunks = token_splitter.split_text(" ".join(tokens))

# Output the resulting chunks
for chunk in chunks:
    print(chunk)

In this example, Jieba is used to tokenize the Chinese text, and a custom tokenizer function (jieba_tokenizer) is defined to integrate with the TokenTextSplitter. The text is first tokenized using Jieba, and then the tokens are joined with spaces to form a string that is passed to the TokenTextSplitter. Adjust the chunk_size and chunk_overlap parameters as needed for your specific use case.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

justinzyw · 2024-12-30T04:02:26Z

@dosu, please provide me another example for SentenceSplitter

dosubot · 2024-12-30T04:03:37Z

To integrate Jieba with the SentenceSplitter in LlamaIndex for handling Chinese text, you can use Jieba to segment the text into sentences before passing it to the splitter. Here's a sample code demonstrating how to do this:

import jieba
from llama_index.core.node_parser.text.sentence import SentenceSplitter

# Sample Chinese text
text = "这是一个用于测试的中文句子。这里是第二个句子。"

# Use Jieba to segment the text into sentences
# Note: Jieba is primarily for word segmentation, so you might need to define your own sentence segmentation logic
sentences = text.split('。')  # Simple example splitting by Chinese period

# Define a custom sentence tokenizer function
def jieba_sentence_tokenizer(text):
    return text.split('。')  # Adjust this logic as needed for more complex sentence splitting

# Create a SentenceSplitter with the custom Jieba sentence tokenizer
sentence_splitter = SentenceSplitter(
    chunk_size=10,  # Adjust chunk size as needed
    chunk_overlap=2,  # Adjust chunk overlap as needed
    chunking_tokenizer_fn=jieba_sentence_tokenizer
)

# Split the text using the SentenceSplitter
chunks = sentence_splitter.split_text(text)

# Output the resulting chunks
for chunk in chunks:
    print(chunk)

In this example, a simple sentence segmentation logic is used by splitting the text on the Chinese period (。). You can adjust the jieba_sentence_tokenizer function to implement more sophisticated sentence segmentation if needed. The SentenceSplitter is then used with this custom tokenizer to process the text. Adjust the chunk_size and chunk_overlap parameters according to your requirements.

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

justinzyw added the question Further information is requested label Dec 29, 2024

justinzyw closed this as completed Dec 30, 2024

justinzyw reopened this Dec 30, 2024

justinzyw closed this as completed Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: how to customize the seperator #17388

[Question]: how to customize the seperator #17388

justinzyw commented Dec 29, 2024

dosubot bot commented Dec 29, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024

[Question]: how to customize the seperator #17388

[Question]: how to customize the seperator #17388

Comments

justinzyw commented Dec 29, 2024

Question Validation

Question

dosubot bot commented Dec 29, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024

justinzyw commented Dec 30, 2024

dosubot bot commented Dec 30, 2024