code update + page upload

Natooz · Oct 7, 2023 · aa0b134 · aa0b134
1 parent f5899b8
commit aa0b134
Show file tree

Hide file tree

Showing 998 changed files with 105,895 additions and 2,963 deletions.
diff --git a/.github/workflows/hugo.yaml b/.github/workflows/hugo.yaml
@@ -0,0 +1,79 @@
+# Sample workflow for building and deploying a Hugo site to GitHub Pages
+name: Deploy Hugo site to Pages
+
+on:
+  # Runs on pushes targeting the default branch
+  push:
+    branches:
+      - main
+
+  # Allows you to run this workflow manually from the Actions tab
+  workflow_dispatch:
+
+# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
+permissions:
+  contents: read
+  pages: write
+  id-token: write
+
+# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
+# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
+concurrency:
+  group: "pages"
+  cancel-in-progress: false
+
+# Default to bash
+defaults:
+  run:
+    shell: bash
+
+jobs:
+  # Build job
+  build:
+    runs-on: ubuntu-latest
+    env:
+      HUGO_VERSION: 0.111.3
+    steps:
+      - name: Install Hugo CLI
+        run: |
+          wget -O ${{ runner.temp }}/hugo.deb https://github.com/gohugoio/hugo/releases/download/v${HUGO_VERSION}/hugo_extended_${HUGO_VERSION}_linux-amd64.deb \
+          && sudo dpkg -i ${{ runner.temp }}/hugo.deb          
+      - name: Install Dart Sass Embedded
+        run: sudo snap install dart-sass-embedded
+      - name: Checkout
+        uses: actions/checkout@v3
+        with:
+          submodules: recursive
+          fetch-depth: 0
+      - name: Setup Pages
+        id: pages
+        uses: actions/configure-pages@v3
+      - name: Install Node.js dependencies
+        run: "[[ -f package-lock.json || -f npm-shrinkwrap.json ]] && npm ci || true"
+      - name: Build with Hugo
+        env:
+          # For maximum backward compatibility with Hugo modules
+          HUGO_ENVIRONMENT: production
+          HUGO_ENV: production
+        run: |
+          cp -r ./page/* . && \
+          hugo \
+            --gc \
+            --minify \
+            --baseURL "${{ steps.pages.outputs.base_url }}/"          
+      - name: Upload artifact
+        uses: actions/upload-pages-artifact@v1
+        with:
+          path: ./public
+
+  # Deployment job
+  deploy:
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+    runs-on: ubuntu-latest
+    needs: build
+    steps:
+      - name: Deploy to GitHub Pages
+        id: deployment
+        uses: actions/deploy-pages@v2
diff --git a/.gitignore b/.gitignore
@@ -1,16 +1,140 @@
-# macOS DS_STORE files
-**/*.DS_STORE
-# Python precompiled files
-*.pyc
 # PyCharm config files
 .idea/
-# PyCharm Virtual Environment
+# macOS DS_STORE files
+*.DS_STORE
+
+data/
+data_gen_tokens/
+runs/
+runs_dis/
+runs_classifier/
+
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+.python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
 venv/
-#
-data/**
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
 
-# personal test file
-test.py
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
 
-# Dataset directory
-# data/*
+# Pyre type checker
+.pyre/
diff --git a/README.md b/README.md
@@ -1,87 +1,48 @@
-# Byte Pair Encoding for Symbolic Music
+# Byte Pair Encoding for Symbolic Music (EMNLP 2023)
 
-Code of the paper [*Byte Pair Encoding for Symbolic Music*](https://arxiv.org/abs/2301.11975).
+[Paper](https://arxiv.org/abs/2301.11975)
+[Demo website](https://Natooz.github.io/bpe-symbolic-music/)
 
-## Steps to reproduce
-
-1. `pip install -r requirements` to install requirements
-2. Download the [GiantMIDI](https://github.com/bytedance/GiantMIDI-Piano/blob/master/disclaimer.md) dataset and put it in `data/`
-3. `sh scripts/download_pop909.sh` to download and preprocess the [POP909](https://github.com/music-x-lab/POP909-Dataset) dataset
-4. `python scripts/tokenize_datasets.py` to tokenize data and learn BPE
-5. `python exp_gen.py` to train generative models and generate results
-6. `python exp_cla.py` to train classification models and test them
-
-[Scripts](./scripts) can be run to get reproduce the analysis.
-
-## BPE learning
-
-<img src="figures/tokenizations_bpe_token_types/POP909-merged_TSD.png" alt="POP909 TSD" width="400"/><img src="figures/tokenizations_bpe_token_types/POP909-merged_REMI.png" alt="POP909 REMI" width="400"/>
-
-<img src="figures/tokenizations_bpe_token_types/GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="400"/><img src="figures/tokenizations_bpe_token_types/GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="400"/>
-
-By orders, figures above are for POP909 TSD, POP909 REMI, GiantMIDI TSD, GiantMIDI REMI
-
-<img src="figures/bpe_nb_tok_combinations.png" alt="GiantMIDI REMI" width="800"/>
-
-## Experiment results
-
-We refer you to the tables of the paper.
-
-## Learned embedding space
-
-### Singular values
-
-#### Generators : POP909 TSD, POP909 REMI, GiantMIDI TSD and GiantMIDI REMI
-
-<img src="figures/singular_value_gen/singular_value_POP909-merged_TSD.png" alt="POP909 TSD" width="200"/><img src="figures/singular_value_gen/singular_value_POP909-merged_REMI.png" alt="POP909 REMI" width="200"/><img src="figures/singular_value_gen/singular_value_GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="200"/><img src="figures/singular_value_gen/singular_value_GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="200"/>
-
-#### Classifiers : $\mathrm{Cla}\_{small}$ TSD, $\mathrm{Cla}\_{small}$ REMI, $\mathrm{Cla}\_{large}$ TSD and $\mathrm{Cla}\_{large}$ REMI
-
-<img src="figures/singular_value_cla/singular_value_GiantMIDI_TSD.png" alt="Cla small TSD" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_REMI.png" alt="Cla small REMI" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_TSD_LARGE.png" alt="Cla large TSD" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_REMI_LARGE.png" alt="Cla large REMI" width="200"/>
+Byte Pair Encoding (BPE) is a compression technique that allows to reduce the sequence length of a corpus by iteratively replacing the most recurrent byte successions by newly created symbols. It is widely used in NLP, as it allows to automatically create vocabularies made of words or parts of words.
 
-### UMAP Generators
+In this paper, we show that it can address two main concerns about how symbolic music was previously tokenized:
 
-Figures are by order for no BPE, BPEx4, BPEx10, BPEx20, BPEx50, BPEx100, PVm and PVDm.
+1. The fairly long sequence length resulting by using one token per note attribute (e.g. pitch, duration) and time events. Long sequences is problematic as the time and space complexity of Transformer models grows quadratically with the input sequence.
+2. The poor usage of the model's embedding space. Language models first project tokens into a learned embedding space, in which the embeddings (continuous representations of the tokens) are learnt to represent their semantic information. This is an essential feature of such models, as it allows them to capture the meaning of the tokens and data. In symbolic music, the tokens usually only represent note attribute values or time values, which do not carry much information other than their absolute value. And vocabularies range often between 200 and 500 tokens, which are then represented on 512 to 1024 dimensions. In such conditions, the embedding space is misused and the potential of the model is poorly exploited.
 
-#### POP909 TSD
+When applied on symbolic music, BPE will allow to drastically reduce the sequence length, while creating new tokens that can represent whole notes, and sequences of notes. The model's efficiency is then greatly improved, while bringing more information per tokens. It greatly improves the quality of generation, while improving up to three times the inference speed.
 
-<img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe20.png" alt="BPEx20" width="200"/>
+BPE is fully implemented within [MidiTok](https://github.com/Natooz/MidiTok), allowing you to easily benefit from this method on top of most of the existing tokenizations.
 
-<img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_PVDm.png" alt="PVDm" width="200"/>
+We invite you to read the paper, and check our [companion website](https://Natooz.github.io/bpe-symbolic-music/) to listen generated results!
 
-#### POP909 REMI
-
-<img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe20.png" alt="BPEx20" width="200"/>
-
-<img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_PVDm.png" alt="PVDm" width="200"/>
-
-#### GiantMIDI TSD
-
-<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe20.png" alt="BPEx20" width="200"/>
-
-<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_PVDm.png" alt="PVDm" width="200"/>
-
-#### GiantMIDI REMI
-
-<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe20.png" alt="BPEx20" width="200"/>
-
-<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_PVDm.png" alt="PVDm" width="200"/>
-
-
-### UMAP Classifiers
-
-These figures are for $\mathrm{Cla}\_{small}$ and TSD. More figures can be found in [figures](./figures).
-
-<img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe20.png" alt="BPEx20" width="200"/>
-
-<img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_PVDm.png" alt="PVDm" width="200"/>
-
-### Intrinsic dimension
-
-#### Generators : POP909 TSD, POP909 REMI, GiantMIDI TSD and GiantMIDI REMI
+## Steps to reproduce
 
-<img src="figures/intrinsic_dimension_gen/intrinsic_dim_POP909-merged_TSD.png" alt="POP909 TSD" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_POP909-merged_REMI.png" alt="POP909 REMI" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="200"/>
+1. `pip install -r requirements` to install requirements
+2. Download the [Maestro](https://magenta.tensorflow.org/datasets/maestro) and [MMD](https://zenodo.org/record/5142664#.YQN3c5NKgWo) datasets and put them in `data/`
+3. `python scripts/preprocess_maestro.py` and `python scripts/preprocess_for_octuple.py`
+4. `python scripts/tokenize_datasets.py` to tokenize data and learn BPE
+5. `python exp_generation.py` to train generative models and generate results
+6. `python exp_pretrain.py` to pretrain classification models
+7. `python exp_cla.py` to train classification models and test them
 
-#### Classifiers : $\mathrm{Cla}\_{small}$ TSD, $\mathrm{Cla}\_{small}$ REMI, $\mathrm{Cla}\_{large}$ TSD and $\mathrm{Cla}\_{large}$ REMI
+[Scripts](./scripts) can be run to get reproduce the analysis.
 
-<img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_TSD.png" alt="Cla small TSD" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_REMI.png" alt="Cla small REMI" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_TSD_LARGE.png" alt="Cla large TSD" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_REMI_LARGE.png" alt="Cla large REMI" width="200"/>
+## Citation
+
+(ACL url/doi/pages will be added once the proceeding will be published)
+```bibtex
+@inproceedings{bpe-symbolic-music,
+    title = "Byte Pair Encoding for Symbolic Music",
+    author = "Fradet, Nathan  and
+      Gutowski, Nicolas  and
+      Chhel, Fabien  and
+      Briot, Jean-Pierre",
+    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/2301.11975",
+}
+```
diff --git a/analysis/bpe_token_types/GiantMIDI_REMI.pdf b/analysis/bpe_token_types/GiantMIDI_REMI.pdf
diff --git a/analysis/bpe_token_types/GiantMIDI_TSD.pdf b/analysis/bpe_token_types/GiantMIDI_TSD.pdf
diff --git a/analysis/bpe_token_types/MMD_REMIPlus.pdf b/analysis/bpe_token_types/MMD_REMIPlus.pdf
diff --git a/analysis/bpe_token_types/MMD_TSDPlus.pdf b/analysis/bpe_token_types/MMD_TSDPlus.pdf
diff --git a/analysis/bpe_token_types/Maestro_REMI.pdf b/analysis/bpe_token_types/Maestro_REMI.pdf
diff --git a/analysis/bpe_token_types/Maestro_TSD.pdf b/analysis/bpe_token_types/Maestro_TSD.pdf
diff --git a/analysis/datasets_features/datasets_duration (beat).pdf b/analysis/datasets_features/datasets_duration (beat).pdf
diff --git a/analysis/datasets_features/datasets_onset.pdf b/analysis/datasets_features/datasets_onset.pdf
diff --git a/analysis/datasets_features/datasets_onset_interval.pdf b/analysis/datasets_features/datasets_onset_interval.pdf
diff --git a/analysis/datasets_features/datasets_pitch.pdf b/analysis/datasets_features/datasets_pitch.pdf
diff --git a/analysis/datasets_features/datasets_velocity.pdf b/analysis/datasets_features/datasets_velocity.pdf