Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How we can use JSONL data formats in s2s task intead of .Parquet format and also how to resume checkpoint #197

Open
Lalaramarya opened this issue Feb 4, 2025 · 1 comment

Comments

@Lalaramarya
Copy link

Thank you for sharing the s2s repo
I am facing two issues:
1: How i can use my own .JSONL data format instead of using VoiceAssistant-400K in .Parquet format
i have prepared data in .jsnol format as given in the demo

{"key": 1, "source_wav": "/DATA/Adarsh/DAS_FR/covost_split/test/common_voice_fr_17299399.mp3.wav", "source_text": "Un vrai travail intéressant va, enfin, être mené sur ce sujet.", "target_wav": "/DATA/Adarsh/DAS_FR/datafolder/cvss_c/fr-en/test/common_voice_fr_17299399.mp3.wav", "target_text": "really interesting work will finally be undertaken on that topic"}
{"key": 2, "source_wav": "/DATA/Adarsh/DAS_FR/covost_split/test/common_voice_fr_17299400.mp3.wav", "source_text": "Une réforme profonde est nécessaire.", "target_wav": "/DATA/Adarsh/DAS_FR/datafolder/cvss_c/fr-en/test/common_voice_fr_17299400.mp3.wav", "target_text": "a profound reform is necessary"}
{"key": 3, "source_wav": "/DATA/Adarsh/DAS_FR/covost_split/test/common_voice_fr_17299401.mp3.wav", "source_text": "Pas si nombreuses que ça", "target_wav": "/DATA/Adarsh/DAS_FR/datafolder/cvss_c/fr-en/test/common_voice_fr_17299401.mp3.wav", "target_text": "not that many"}
{"key": 4, "source_wav": "/DATA/Adarsh/DAS_FR/covost_split/test/common_voice_fr_17300796.mp3.wav", "source_text": "Un comité interministériel du handicap s’est tenu il y a quelques semaines.", "target_wav": "/DATA/Adarsh/DAS_FR/datafolder/cvss_c/fr-en/test/common_voice_fr_17300796.mp3.wav", "target_text": "an inter ministerial committee on disability was held a few weeks back"}

but this is troughing me an error
File "/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/datasets/builder.py", line 1860, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/datasets/builder.py", line 1991, in _prepare_split_single
raise DatasetGenerationCastError.from_cast_error(
datasets.exceptions.DatasetGenerationCastError: An error occurred while generating the dataset

All the data files must have the same columns, but at some point there are 5 new columns (key, target_wav, target_text, source_wav, source_text) and 8 missing columns (round, index, question, answer_cosyvoice_speech_token, split_name, question_audio, answer_snac, answer).

This happened while the JSONL dataset builder was generating data using

/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni/data/train-00000-of-00100.jsonl

So it is in the jsnol format also, i have maintained these 8 columns (round, index, question, answer_cosyvoice_speech_token, split_name, question_audio, answer_snac, answer)

or these column key, target_wav, target_text, source_wav, source_text as given in the demo jsonl file

2: I have set the resume condition, but it is still starting from the 0th step instead of resuming the step where it stopped.

examples/s2s/model/slam_model_s2s.py:96: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
ckpt_dict = torch.load(ckpt_path, map_location="cpu")
[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> Module Qwen2-0.5b
[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> Qwen2-0.5b has 494.032768 Million params

[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> Module Qwen2-0.5b
[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> Qwen2-0.5b has 494.032768 Million params

[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> Module linear
[2025-02-04 16:16:54][slam_llm.utils.train_utils][INFO] - --> linear has 9.702272 Million params

[2025-02-04 16:16:56][slam_model_s2s.py][INFO] - Resize llm embedding layer's vocab size to 156160

[2025-02-04 16:16:56][slam_model_s2s.py][INFO] - loading other parts from: /DATA/Lalaram/SLAM_omni/SLAM-LLM/s2s_train_v4-Qwen2-0.5b-gpu2-btz1-lr1e-4-fp16-epochs10-whisper_small-latency0-group1/s2s_epoch_1_step_99000/model.pt

examples/s2s/model/slam_model_s2s.py:96: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
ckpt_dict = torch.load(ckpt_path, map_location="cpu")
[2025-02-04 16:16:57][slam_llm.utils.train_utils][INFO] - --> Model s2s
[2025-02-04 16:16:57][slam_llm.utils.train_utils][INFO] - --> s2s has 507.519744 Million params

[rank1]:[W204 16:16:59.253905638 Utils.hpp:110] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[rank0]:[W204 16:16:59.254548938 Utils.hpp:110] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[2025-02-04 16:16:59][root][INFO] - dataset_config: {'dataset': 'speech_dataset_s2s', 'file': 'examples/s2s/speech_dataset_s2s.py:get_speech_dataset', 'train_data_path': '/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni', 'val_data_path': '/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni', 'train_split': 'train', 'test_split': 'validation', 'prompt': 'Conduct a spoken conversation with the user. ', 'data_path': None, 'max_words': None, 'max_mel': None, 'fix_length_audio': -1, 'inference_mode': False, 'input_type': 'mel', 'mel_size': 80, 'normalize': False, 'seed': 42, 'manifest_format': 'datasets', 'split_size': 0.01, 'vocab_config': {'text_vocabsize': 151936, 'text_specialtokens': 64, 'audio_vocabsize': 4096, 'audio_specialtokens': 64, 'code_layer': 1, 'padded_text_vocabsize': 152000, 'padded_audio_vocabsize': 4160, 'total_audio_vocabsize': 29120, 'total_vocabsize': 156160, 'eot': 151936, 'pad_t': 151937, 'input_t': 151938, 'answer_t': 151939, 'asr': 151940, 'eoa': 4096, 'pad_a': 4097, 'input_a': 4098, 'answer_a': 4099, 'split': 4100}, 'load_from_cache_file': True, 'task_type': 's2s', 'upsample_text_tokens': False, 'upsampling_factor': 1, 'upsample_method': 'repeat', 'code_type': 'CosyVoice', 'num_latency_tokens': 0, 'do_layershift': False}
[2025-02-04 16:16:59][root][INFO] - dataset_config: {'dataset': 'speech_dataset_s2s', 'file': 'examples/s2s/speech_dataset_s2s.py:get_speech_dataset', 'train_data_path': '/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni', 'val_data_path': '/DATA/Lalaram/SLAM_omni/SLAM-LLM/Dataset/VoiceAssistant-400K-SLAM-Omni', 'train_split': 'train', 'test_split': 'validation', 'prompt': 'Conduct a spoken conversation with the user. ', 'data_path': None, 'max_words': None, 'max_mel': None, 'fix_length_audio': -1, 'inference_mode': False, 'input_type': 'mel', 'mel_size': 80, 'normalize': False, 'seed': 42, 'manifest_format': 'datasets', 'split_size': 0.01, 'vocab_config': {'text_vocabsize': 151936, 'text_specialtokens': 64, 'audio_vocabsize': 4096, 'audio_specialtokens': 64, 'code_layer': 1, 'padded_text_vocabsize': 152000, 'padded_audio_vocabsize': 4160, 'total_audio_vocabsize': 29120, 'total_vocabsize': 156160, 'eot': 151936, 'pad_t': 151937, 'input_t': 151938, 'answer_t': 151939, 'asr': 151940, 'eoa': 4096, 'pad_a': 4097, 'input_a': 4098, 'answer_a': 4099, 'split': 4100}, 'load_from_cache_file': True, 'task_type': 's2s', 'upsample_text_tokens': False, 'upsampling_factor': 1, 'upsample_method': 'repeat', 'code_type': 'CosyVoice', 'num_latency_tokens': 0, 'do_layershift': False}
[2025-02-04 16:16:59][datasets][INFO] - PyTorch version 2.4.0+cu124 available.
[2025-02-04 16:16:59][datasets][INFO] - PyTorch version 2.4.0+cu124 available.
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 456402.77it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 435566.27it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 440988.70it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 418792.15it/s]
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 508/508 [00:00<00:00, 12469.17it/s]
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 508/508 [00:00<00:00, 12037.48it/s]
[2025-02-04 16:16:59][root][INFO] - --> Training Set Length = 458440
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 441206.23it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 437216.34it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 447161.77it/s]
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 853/853 [00:00<00:00, 432930.94it/s]
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 508/508 [00:00<00:00, 11781.62it/s]
Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 508/508 [00:00<00:00, 12704.56it/s]
[2025-02-04 16:17:00][slam_llm.utils.config_utils][INFO] - Using batching strategy: custom
[2025-02-04 16:17:00][slam_llm.utils.config_utils][INFO] - Using batching strategy: custom
[2025-02-04 16:17:00][root][INFO] - --> Validation Set Length = 4631
[2025-02-04 16:17:00][slam_llm.utils.config_utils][INFO] - Using batching strategy: custom
[2025-02-04 16:17:00][slam_llm.utils.config_utils][INFO] - Using batching strategy: custom
/DATA/Lalaram/SLAM_omni/SLAM-LLM/src/slam_llm/utils/train_utils.py:71: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
scaler = torch.cuda.amp.GradScaler()
/DATA/Lalaram/SLAM_omni/SLAM-LLM/src/slam_llm/utils/train_utils.py:71: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
scaler = torch.cuda.amp.GradScaler()
/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(
Training Epoch: 1: 0%| | 0/229220 [00:00<?, ?it/s]/DATA/Lalaram/SLAM_omni/SLAM-LLM/src/slam_llm/utils/train_utils.py:109: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with autocast():
/DATA/Lalaram/SLAM_omni/SLAM-LLM/src/slam_llm/utils/train_utils.py:109: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with autocast():
Training Epoch: 1/10, step 2999/229220 completed (loss: 1.524409532546997, acc: 0.6829268336296082): 1%|▊ | 3000/229220 [23:09<29:43:45, 2.11it/s]/DATA/anaconda3/envs/Lalaram_SLAM-Omni/lib/python3.10/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
warnings.warn(

@cwx-worst-one
Copy link
Collaborator

Sorry for the delayed response.

  1. We recently fixed support for the .JSONL file format. You can find the updated JSONL demo here: jsonl_demo-en.jsonl. Please note that users need to manually generate the corresponding audio tokens for the response audio. Also, when using .JSONL format data, you must set manifest_format=parquet in the script.
  2. Currently, SLAM-LLM does not support a resume mechanism. We apologize for any inconvenience this may cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants