fix #252

jiaqili3 · jiaqili3 · commit 908c85d538bb · 2024-07-25T14:54:32.000+08:00
diff --git a/egs/tts/VALLE_V2/README.md b/egs/tts/VALLE_V2/README.md
@@ -17,26 +17,29 @@ To ensure your transformers library can run the code, we recommend additionally
 pip install -U transformers==4.41.2
 ```
 
-<!-- espeak-ng is required to run G2p. To install it, you could refer to: 
-https://github.com/espeak-ng/espeak-ng/blob/master/docs/guide.md
-
-For Linux, it should be `sudo apt-get install espeak-ng`.
-For Windows, refer to the above link.
-If you do not have sudo privilege, you could build the library by following the last section of this readme. -->
-
 ## Inferencing pretrained VALL-E models
 ### Download pretrained weights
-You need to download our pretrained weights from huggingface. 
+You need to download our pretrained weights from huggingface. Our models are trained on the MLS dataset (45k hours of English, contains 10-20s speech).
 
 Script to download AR and NAR model checkpoint: 
 ```bash
 huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
 ```
 Script to download codec model (SpeechTokenizer) checkpoint:
 ```bash
-huggingface-cli download amphion/valle speechtokenizer_hubert_avg/SpeechTokenizer.pt speechtokenizer_hubert_avg/config.json --local-dir ckpts
+mkdir -p ckpts/speechtokenizer_hubert_avg && huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
+```
+
+If you cannot access huggingface, consider using the huggingface mirror to download: 
+```bash
+HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle valle_ar_mls_196000.bin valle_nar_mls_164000.bin --local-dir ckpts
+```
+Script to download codec model (SpeechTokenizer) checkpoint:
+```bash
+mkdir -p ckpts/speechtokenizer_hubert_avg && HF_ENDPOINT=https://hf-mirror.com huggingface-cli download amphion/valle SpeechTokenizer.pt config.json --local-dir ckpts/speechtokenizer_hubert_avg
 ```
 
+
 ### Inference in IPython notebook
 
 We provide our pretrained VALL-E model that is trained on 45k hours MLS dataset.
@@ -111,31 +114,6 @@ You should also select a reasonable batch size at the "batch_size" entry (curren
 
 You can change other experiment settings in the `/egs/tts/VALLE_V2/exp_ar_libritts.json` such as the learning rate, optimizer and the dataset.
 
-Here we choose `libritts` dataset we added and set `use_dynamic_dataset` false.
-
-Config `use_dynamic_dataset` is used to solve the problem of inconsistent sequence length and improve gpu utilization, here we set it to false for simplicity.
-
-```json
-"dataset": {
-          "use_dynamic_batchsize": false,
-          "name": "libritts"
-        },
-```
-
-We also recommend changing "num_hidden_layers" if your GPU memory is limited.
-
-**Set smaller batch_size if you are out of memory😢😢**
-
-I used batch_size=3 to successfully run on a single card, if you'r out of memory, try smaller.
-
-```json
-        "batch_size": 3,
-        "max_tokens": 11000,
-        "max_sentences": 64,
-        "random_seed": 0
-```
-
-
 ### Run the command to Train AR model
 (Make sure your current directory is at the Amphion root directory).
 Run:
diff --git a/egs/tts/VALLE_V2/demo.ipynb b/egs/tts/VALLE_V2/demo.ipynb
@@ -78,7 +78,7 @@
     "# prepare inference data\n",
     "import librosa\n",
     "import torch\n",
-    "wav, _ = librosa.load('./egs/tts/valle_v2/example.wav', sr=16000)\n",
+    "wav, _ = librosa.load('./egs/tts/VALLE_V2/example.wav', sr=16000)\n",
     "wav = torch.tensor(wav, dtype=torch.float32)\n",
     "from IPython.display import Audio\n",
     "Audio(wav, rate = 16000)"