From 88cf4176cc1044cc956ae0a802ebbd1e3ec562d6 Mon Sep 17 00:00:00 2001
From: Nathan Fradet <56734983+Natooz@users.noreply.github.com>
Date: Tue, 10 Oct 2023 09:55:09 +0200
Subject: [PATCH] adding best models to hub

---
 README.md               |   6 +++
 model_card.md           | 115 ++++++++++++++++++++++++++++++++++++++++
 push_model_to_hf_hub.py |  52 ++++++++++++++++++
 3 files changed, 173 insertions(+)
 create mode 100644 model_card.md
 create mode 100644 push_model_to_hf_hub.py

diff --git a/README.md b/README.md
index 539bf0c..7651d37 100644
--- a/README.md
+++ b/README.md
@@ -2,6 +2,10 @@
 
 [[Paper](https://arxiv.org/abs/2301.11975)]
 [[Companion website](https://Natooz.github.io/BPE-Symbolic-Music/)]
+[[🤗 TSD 20k](https://huggingface.co/Natooz/Maestro-TSD-bpe20k)]
+[[🤗 REMI 20k](https://huggingface.co/Natooz/Maestro-REMI-bpe20k)]
+
+## Intro
 
 Byte Pair Encoding (BPE) is a compression technique that allows to reduce the sequence length of a corpus by iteratively replacing the most recurrent byte successions by newly created symbols. It is widely used in NLP, as it allows to automatically create vocabularies made of words or parts of words.
 
@@ -16,6 +20,8 @@ BPE is fully implemented within [MidiTok](https://github.com/Natooz/MidiTok), al
 
 We invite you to read the paper, and check our [companion website](https://Natooz.github.io/bpe-symbolic-music/) to listen generated results!
 
+Finally, the best models are shared on Hugging Face: [TSD 20k](https://huggingface.co/Natooz/Maestro-TSD-bpe20k) and [REMI 20k](https://huggingface.co/Natooz/Maestro-REMI-bpe20k)
+
 ## Steps to reproduce
 
 1. `pip install -r requirements` to install requirements
diff --git a/model_card.md b/model_card.md
new file mode 100644
index 0000000..3479c8a
--- /dev/null
+++ b/model_card.md
@@ -0,0 +1,115 @@
+---
+# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
+# Doc / guide: https://huggingface.co/docs/hub/model-cards
+{}
+---
+
+# Model card
+
+<!-- Provide a quick summary of what the model is/does. -->
+
+This is a generative model from the paper "*Byte Pair Encoding for Symbolic Music*" (EMNLP 2023). The model has been trained with Byte Pair Encoding (BPE) on the [Maestro dataset](https://magenta.tensorflow.org/datasets/maestro) to generate classical piano music with the REMI tokenizer.
+
+## Model Details
+
+### Model Description
+
+<!-- Provide a longer summary of what this model is. -->
+
+It has a vocabulary of 20k tokens learned with [Byte Pair Encoding (BPE)](https://arxiv.org/abs/2301.11975) using [MidiTok](https://github.com/Natooz/MidiTok).
+
+- **Developed and shared by:** [Nathan Fradet](https://twitter.com/NathanFradet)
+- **Affiliations**: [Sorbonne University (LIP6 lab)](https://www.sorbonne-universite.fr/en) and [Aubay](https://aubay.com/en/)
+- **Model type:** causal autoregressive Transformer
+- **Backbone model:** [GPT2](https://huggingface.co/docs/transformers/model_doc/gpt2)
+- **Music genres:** Classical piano 🎹
+- **License:** Apache 2.0
+
+### Model Sources
+
+<!-- Provide the basic links for the model. -->
+
+- **Repository:** https://github.com/Natooz/BPE-Symbolic-Music
+- **Paper:** https://arxiv.org/abs/2301.11975
+
+## Uses
+
+The model is designed for autoregressive music generation. It generates the continuation of a music prompt.
+
+## How to Get Started with the Model
+
+Use the code below to get started with the model.
+You will need the `miditok`, `transformers` and `torch` packages to make it run, that can be installed with pip.
+
+You will also need to manually download the `tokenizer.conf` file from the [repo files](https://huggingface.co/Natooz/Maestro-REMI-bpe20k/tree/main).
+
+```Python
+import torch
+from transformers import AutoModelForCausalLM
+from miditok import REMI
+from miditoolkit import MidiFile
+
+torch.set_default_device("cuda")
+model = AutoModelForCausalLM.from_pretrained("Natooz/Maestro-REMI-bpe20k", trust_remote_code=True, torch_dtype="auto")
+tokenizer = REMI(params="tokenizer.conf")
+input_midi = MidiFile("path/to/file.mid")
+input_tokens = tokenizer(input_midi)
+
+generated_token_ids = model.generate(input_tokens.ids, max_length=200)
+generated_midi = tokenizer(generated_token_ids)
+generated_midi.dump("path/to/continued.mid")
+```
+
+## Training Details
+
+### Training Data
+
+<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+
+The model has been trained on the [Maestro](https://magenta.tensorflow.org/datasets/maestro) dataset. The dataset contains about 200 hours of classical piano music. The tokenizer is trained with Byte Pair Encoding (BPE) to build a vocabulary of 20k tokens.
+
+### Training Procedure 
+
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+
+- **Training regime:** fp16 mixed precision on V100 PCIE 32GB GPUs
+- **Compute Region:** France
+
+### Training hyperparameters
+
+The following hyperparameters were used during training:
+- learning_rate: 0.0001
+- train_batch_size: 64
+- eval_batch_size: 96
+- seed: 444
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: cosine_with_restarts
+- lr_scheduler_warmup_ratio: 0.3
+- training_steps: 100000
+
+### Environmental impact
+
+We cannot estimate reliably the amount of CO2eq emitted, as we lack data on the exact power source used during training. However, we can highlight that the cluster used is mostly powered by nuclear energy, which is a low carbon energy source ensuring a reduced direct environmental impact.
+
+## Citation
+
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+
+**BibTeX:**
+
+```bibtex
+@inproceedings{bpe-symbolic-music,
+    title = "Byte Pair Encoding for Symbolic Music",
+    author = "Fradet, Nathan  and
+      Gutowski, Nicolas  and
+      Chhel, Fabien  and
+      Briot, Jean-Pierre",
+    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2301.11975",
+}
+```
+
diff --git a/push_model_to_hf_hub.py b/push_model_to_hf_hub.py
new file mode 100644
index 0000000..40e2584
--- /dev/null
+++ b/push_model_to_hf_hub.py
@@ -0,0 +1,52 @@
+#!/usr/bin/python3 python
+
+"""
+Push model to HF hub
+"""
+
+# https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md
+MODEL_CARD_KWARGS = {
+    "license": "apache-2.0",
+    "tags": ["music-generation"],
+}
+
+
+if __name__ == "__main__":
+    from argparse import ArgumentParser
+
+    from transformers import Seq2SeqTrainer, AutoModelForCausalLM
+
+    from exp_generation import experiments
+
+    parser = ArgumentParser(description="Model training script")
+    parser.add_argument("--hub-token", type=str, help="", required=False, default="?")
+    args = vars(parser.parse_args())
+
+    for exp_ in experiments:
+        for baseline_ in exp_.baselines:
+            if baseline_.tokenization_config.bpe_vocab_size != 20000:
+                continue
+            # Load model
+            model_ = AutoModelForCausalLM.from_pretrained(str(baseline_.run_path))
+
+            model_name = f"{exp_.dataset}-{baseline_.tokenization}-bpe{baseline_.tokenization_config.bpe_vocab_size // 1000}k"
+            model_card_kwargs = {
+                "model_name": model_name,
+                "dataset": exp_.dataset,
+            }
+            model_card_kwargs.update(MODEL_CARD_KWARGS)
+
+            # Push to hub
+            trainer = Seq2SeqTrainer(model=model_, args=baseline_.training_config)
+            trainer.create_model_card(**model_card_kwargs)
+            baseline_.tokenizer.save_params(baseline_.run_path / "tokenizer.conf")
+            model_.push_to_hub(
+                repo_id=model_name,
+                commit_message=f"Uploading {model_name}",
+                token=args["hub_token"],
+                safe_serialization=True,
+            )
+            # The trainer does not push the weights as safe tensors
+            # Don't forget to upload manually the training results / logs
+            # trainer.push_to_hub(f"Uploading {model_name}", **model_card_kwargs)
+            # https://github.com/huggingface/transformers/issues/25992