-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: how to customize the seperator #17388
Comments
To customize the separator in the sentence splitter when building an index using the from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import VectorStoreIndex
# Customize the SentenceSplitter with your desired separator
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10, separator="your_custom_separator")
# Use the custom SentenceSplitter in the transformations list
index_to_build = VectorStoreIndex.from_documents(
documents=documents, transformations=[text_splitter]
) This code snippet demonstrates how to create a custom To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu, I am using the token splitter and sentence splitter as bleow. However, they are not suitable for Chinese inputs. Could you please suggest other professional token splitters and sentence splitters specifically designed for Chinese inputs? token_splitter = TokenTextSplitter(
Settings.transformations = [token_splitter, sentence_splitter] |
The
To integrate these libraries, you would typically preprocess your text using one of these tools to segment the text into tokens or sentences, and then use the segmented text with LlamaIndex's splitters. This approach allows you to leverage the strengths of specialized Chinese text processing tools while still using LlamaIndex for your indexing needs. To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu, could you please write me sample codes which integrate Jieba into TokenTextSplitter? |
To integrate Jieba with the import jieba
from llama_index.core.node_parser.text.token import TokenTextSplitter
# Sample Chinese text
text = "这是一个用于测试的中文句子。"
# Use Jieba to tokenize the text
tokens = list(jieba.cut(text))
# Define a custom tokenizer function for TokenTextSplitter
def jieba_tokenizer(text):
return list(jieba.cut(text))
# Create a TokenTextSplitter with the custom Jieba tokenizer
token_splitter = TokenTextSplitter(
chunk_size=10, # Adjust chunk size as needed
chunk_overlap=2, # Adjust chunk overlap as needed
tokenizer=jieba_tokenizer
)
# Split the text using the TokenTextSplitter
chunks = token_splitter.split_text(" ".join(tokens))
# Output the resulting chunks
for chunk in chunks:
print(chunk) In this example, Jieba is used to tokenize the Chinese text, and a custom tokenizer function ( To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
@dosu, please provide me another example for SentenceSplitter |
To integrate Jieba with the import jieba
from llama_index.core.node_parser.text.sentence import SentenceSplitter
# Sample Chinese text
text = "这是一个用于测试的中文句子。这里是第二个句子。"
# Use Jieba to segment the text into sentences
# Note: Jieba is primarily for word segmentation, so you might need to define your own sentence segmentation logic
sentences = text.split('。') # Simple example splitting by Chinese period
# Define a custom sentence tokenizer function
def jieba_sentence_tokenizer(text):
return text.split('。') # Adjust this logic as needed for more complex sentence splitting
# Create a SentenceSplitter with the custom Jieba sentence tokenizer
sentence_splitter = SentenceSplitter(
chunk_size=10, # Adjust chunk size as needed
chunk_overlap=2, # Adjust chunk overlap as needed
chunking_tokenizer_fn=jieba_sentence_tokenizer
)
# Split the text using the SentenceSplitter
chunks = sentence_splitter.split_text(text)
# Output the resulting chunks
for chunk in chunks:
print(chunk) In this example, a simple sentence segmentation logic is used by splitting the text on the Chinese period ( To continue talking to Dosu, mention @dosu. Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other |
Question Validation
Question
I am using the following method to build an index. However, the problem is that the sentence splitter is not suitable for my input documents and I would like somehow to specific the seperators. How can I achieve this?. Below is my code:
documents = SimpleDirectoryReader(doc_dir).load_data()
index_to_build = VectorStoreIndex.from_documents(documents=documents)
The text was updated successfully, but these errors were encountered: