-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to look into the processed data? #266
Comments
The following works for me import numpy as np
from datatrove.pipeline.tokens.merger import load_doc_ends, get_data_reader
def read_tokenized_data(data_file):
with open(f"{data_file}.index", 'rb') as f:
doc_ends = load_doc_ends(f)
reader = get_data_reader(open(data_file, 'rb'), doc_ends, nb_bytes=2)
decode = lambda x: np.frombuffer(x, dtype=np.uint16).astype(int)
return map(decode, reader)
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
data_file = 'test/000_test.ds'
for i, input_ids in enumerate(read_tokenized_data(data_file)):
if i == 5:
break
print(len(input_ids))
print(tokenizer.decode(input_ids))
print('\n-------------------\n') |
Alternatively, you could use from datatrove.utils.dataset import DatatroveFileDataset
path = 'test/test_tokenized_00000_00000_shuffled.ds'
dataset = DatatroveFileDataset(
file_path=path,
seq_len=2048,
token_size=2,
)
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
for batch in dataset:
input_ids = batch['input_ids'].numpy()
print(tokenizer.decode(input_ids))
break |
Thank you so much! I will have a try. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
After running
tokenize_from_hf_to_s3.py
, I would like to inspect the resulting data. But I find that the current data is in a binary file (.ds
). is there a way to allow me to look into the data?Thanks!
The text was updated successfully, but these errors were encountered: