Skip to content

Commit

Permalink
code update + page upload
Browse files Browse the repository at this point in the history
  • Loading branch information
Natooz committed Oct 7, 2023
1 parent f5899b8 commit aa0b134
Show file tree
Hide file tree
Showing 998 changed files with 105,895 additions and 2,963 deletions.
79 changes: 79 additions & 0 deletions .github/workflows/hugo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Sample workflow for building and deploying a Hugo site to GitHub Pages
name: Deploy Hugo site to Pages

on:
# Runs on pushes targeting the default branch
push:
branches:
- main

# Allows you to run this workflow manually from the Actions tab
workflow_dispatch:

# Sets permissions of the GITHUB_TOKEN to allow deployment to GitHub Pages
permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment, skipping runs queued between the run in-progress and latest queued.
# However, do NOT cancel in-progress runs as we want to allow these production deployments to complete.
concurrency:
group: "pages"
cancel-in-progress: false

# Default to bash
defaults:
run:
shell: bash

jobs:
# Build job
build:
runs-on: ubuntu-latest
env:
HUGO_VERSION: 0.111.3
steps:
- name: Install Hugo CLI
run: |
wget -O ${{ runner.temp }}/hugo.deb https://github.com/gohugoio/hugo/releases/download/v${HUGO_VERSION}/hugo_extended_${HUGO_VERSION}_linux-amd64.deb \
&& sudo dpkg -i ${{ runner.temp }}/hugo.deb
- name: Install Dart Sass Embedded
run: sudo snap install dart-sass-embedded
- name: Checkout
uses: actions/checkout@v3
with:
submodules: recursive
fetch-depth: 0
- name: Setup Pages
id: pages
uses: actions/configure-pages@v3
- name: Install Node.js dependencies
run: "[[ -f package-lock.json || -f npm-shrinkwrap.json ]] && npm ci || true"
- name: Build with Hugo
env:
# For maximum backward compatibility with Hugo modules
HUGO_ENVIRONMENT: production
HUGO_ENV: production
run: |
cp -r ./page/* . && \
hugo \
--gc \
--minify \
--baseURL "${{ steps.pages.outputs.base_url }}/"
- name: Upload artifact
uses: actions/upload-pages-artifact@v1
with:
path: ./public

# Deployment job
deploy:
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
runs-on: ubuntu-latest
needs: build
steps:
- name: Deploy to GitHub Pages
id: deployment
uses: actions/deploy-pages@v2
146 changes: 135 additions & 11 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,16 +1,140 @@
# macOS DS_STORE files
**/*.DS_STORE
# Python precompiled files
*.pyc
# PyCharm config files
.idea/
# PyCharm Virtual Environment
# macOS DS_STORE files
*.DS_STORE

data/
data_gen_tokens/
runs/
runs_dis/
runs_classifier/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
#
data/**
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# personal test file
test.py
# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Dataset directory
# data/*
# Pyre type checker
.pyre/
113 changes: 37 additions & 76 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,87 +1,48 @@
# Byte Pair Encoding for Symbolic Music
# Byte Pair Encoding for Symbolic Music (EMNLP 2023)

Code of the paper [*Byte Pair Encoding for Symbolic Music*](https://arxiv.org/abs/2301.11975).
[Paper](https://arxiv.org/abs/2301.11975)
[Demo website](https://Natooz.github.io/bpe-symbolic-music/)

## Steps to reproduce

1. `pip install -r requirements` to install requirements
2. Download the [GiantMIDI](https://github.com/bytedance/GiantMIDI-Piano/blob/master/disclaimer.md) dataset and put it in `data/`
3. `sh scripts/download_pop909.sh` to download and preprocess the [POP909](https://github.com/music-x-lab/POP909-Dataset) dataset
4. `python scripts/tokenize_datasets.py` to tokenize data and learn BPE
5. `python exp_gen.py` to train generative models and generate results
6. `python exp_cla.py` to train classification models and test them

[Scripts](./scripts) can be run to get reproduce the analysis.

## BPE learning

<img src="figures/tokenizations_bpe_token_types/POP909-merged_TSD.png" alt="POP909 TSD" width="400"/><img src="figures/tokenizations_bpe_token_types/POP909-merged_REMI.png" alt="POP909 REMI" width="400"/>

<img src="figures/tokenizations_bpe_token_types/GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="400"/><img src="figures/tokenizations_bpe_token_types/GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="400"/>

By orders, figures above are for POP909 TSD, POP909 REMI, GiantMIDI TSD, GiantMIDI REMI

<img src="figures/bpe_nb_tok_combinations.png" alt="GiantMIDI REMI" width="800"/>

## Experiment results

We refer you to the tables of the paper.

## Learned embedding space

### Singular values

#### Generators : POP909 TSD, POP909 REMI, GiantMIDI TSD and GiantMIDI REMI

<img src="figures/singular_value_gen/singular_value_POP909-merged_TSD.png" alt="POP909 TSD" width="200"/><img src="figures/singular_value_gen/singular_value_POP909-merged_REMI.png" alt="POP909 REMI" width="200"/><img src="figures/singular_value_gen/singular_value_GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="200"/><img src="figures/singular_value_gen/singular_value_GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="200"/>

#### Classifiers : $\mathrm{Cla}\_{small}$ TSD, $\mathrm{Cla}\_{small}$ REMI, $\mathrm{Cla}\_{large}$ TSD and $\mathrm{Cla}\_{large}$ REMI

<img src="figures/singular_value_cla/singular_value_GiantMIDI_TSD.png" alt="Cla small TSD" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_REMI.png" alt="Cla small REMI" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_TSD_LARGE.png" alt="Cla large TSD" width="200"/><img src="figures/singular_value_cla/singular_value_GiantMIDI_REMI_LARGE.png" alt="Cla large REMI" width="200"/>
Byte Pair Encoding (BPE) is a compression technique that allows to reduce the sequence length of a corpus by iteratively replacing the most recurrent byte successions by newly created symbols. It is widely used in NLP, as it allows to automatically create vocabularies made of words or parts of words.

### UMAP Generators
In this paper, we show that it can address two main concerns about how symbolic music was previously tokenized:

Figures are by order for no BPE, BPEx4, BPEx10, BPEx20, BPEx50, BPEx100, PVm and PVDm.
1. The fairly long sequence length resulting by using one token per note attribute (e.g. pitch, duration) and time events. Long sequences is problematic as the time and space complexity of Transformer models grows quadratically with the input sequence.
2. The poor usage of the model's embedding space. Language models first project tokens into a learned embedding space, in which the embeddings (continuous representations of the tokens) are learnt to represent their semantic information. This is an essential feature of such models, as it allows them to capture the meaning of the tokens and data. In symbolic music, the tokens usually only represent note attribute values or time values, which do not carry much information other than their absolute value. And vocabularies range often between 200 and 500 tokens, which are then represented on 512 to 1024 dimensions. In such conditions, the embedding space is misused and the potential of the model is poorly exploited.

#### POP909 TSD
When applied on symbolic music, BPE will allow to drastically reduce the sequence length, while creating new tokens that can represent whole notes, and sequences of notes. The model's efficiency is then greatly improved, while bringing more information per tokens. It greatly improves the quality of generation, while improving up to three times the inference speed.

<img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe20.png" alt="BPEx20" width="200"/>
BPE is fully implemented within [MidiTok](https://github.com/Natooz/MidiTok), allowing you to easily benefit from this method on top of most of the existing tokenizations.

<img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_TSD_PVDm.png" alt="PVDm" width="200"/>
We invite you to read the paper, and check our [companion website](https://Natooz.github.io/bpe-symbolic-music/) to listen generated results!

#### POP909 REMI

<img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe20.png" alt="BPEx20" width="200"/>

<img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_POP909-merged_REMI_PVDm.png" alt="PVDm" width="200"/>

#### GiantMIDI TSD

<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe20.png" alt="BPEx20" width="200"/>

<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_TSD_PVDm.png" alt="PVDm" width="200"/>

#### GiantMIDI REMI

<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe20.png" alt="BPEx20" width="200"/>

<img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_PVm.png" alt="PVm" width="200"/><img src="figures/umap_3d_gen/umap_3d_GiantMIDI_REMI_PVDm.png" alt="PVDm" width="200"/>


### UMAP Classifiers

These figures are for $\mathrm{Cla}\_{small}$ and TSD. More figures can be found in [figures](./figures).

<img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_noBPE.png" alt="No BPE" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe4.png" alt="BPEx4" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe10.png" alt="BPEx10" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe20.png" alt="BPEx20" width="200"/>

<img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe50.png" alt="BPEx50" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_bpe100.png" alt="BPEx100" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_PVm.png" alt="PVm" width="200"/><img src="figures/umap_2d_cla/umap_2d_GiantMIDI_TSD_PVDm.png" alt="PVDm" width="200"/>

### Intrinsic dimension

#### Generators : POP909 TSD, POP909 REMI, GiantMIDI TSD and GiantMIDI REMI
## Steps to reproduce

<img src="figures/intrinsic_dimension_gen/intrinsic_dim_POP909-merged_TSD.png" alt="POP909 TSD" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_POP909-merged_REMI.png" alt="POP909 REMI" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_GiantMIDI_TSD.png" alt="GiantMIDI TSD" width="200"/><img src="figures/intrinsic_dimension_gen/intrinsic_dim_GiantMIDI_REMI.png" alt="GiantMIDI REMI" width="200"/>
1. `pip install -r requirements` to install requirements
2. Download the [Maestro](https://magenta.tensorflow.org/datasets/maestro) and [MMD](https://zenodo.org/record/5142664#.YQN3c5NKgWo) datasets and put them in `data/`
3. `python scripts/preprocess_maestro.py` and `python scripts/preprocess_for_octuple.py`
4. `python scripts/tokenize_datasets.py` to tokenize data and learn BPE
5. `python exp_generation.py` to train generative models and generate results
6. `python exp_pretrain.py` to pretrain classification models
7. `python exp_cla.py` to train classification models and test them

#### Classifiers : $\mathrm{Cla}\_{small}$ TSD, $\mathrm{Cla}\_{small}$ REMI, $\mathrm{Cla}\_{large}$ TSD and $\mathrm{Cla}\_{large}$ REMI
[Scripts](./scripts) can be run to get reproduce the analysis.

<img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_TSD.png" alt="Cla small TSD" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_REMI.png" alt="Cla small REMI" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_TSD_LARGE.png" alt="Cla large TSD" width="200"/><img src="figures/intrinsic_dimension_cla/intrinsic_dim_GiantMIDI_REMI_LARGE.png" alt="Cla large REMI" width="200"/>
## Citation

(ACL url/doi/pages will be added once the proceeding will be published)
```bibtex
@inproceedings{bpe-symbolic-music,
title = "Byte Pair Encoding for Symbolic Music",
author = "Fradet, Nathan and
Gutowski, Nicolas and
Chhel, Fabien and
Briot, Jean-Pierre",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2301.11975",
}
```
Binary file added analysis/bpe_token_types/GiantMIDI_REMI.pdf
Binary file not shown.
Binary file added analysis/bpe_token_types/GiantMIDI_TSD.pdf
Binary file not shown.
Binary file added analysis/bpe_token_types/MMD_REMIPlus.pdf
Binary file not shown.
Binary file added analysis/bpe_token_types/MMD_TSDPlus.pdf
Binary file not shown.
Binary file added analysis/bpe_token_types/Maestro_REMI.pdf
Binary file not shown.
Binary file added analysis/bpe_token_types/Maestro_TSD.pdf
Binary file not shown.
Binary file not shown.
Binary file added analysis/datasets_features/datasets_onset.pdf
Binary file not shown.
Binary file not shown.
Binary file added analysis/datasets_features/datasets_pitch.pdf
Binary file not shown.
Binary file added analysis/datasets_features/datasets_velocity.pdf
Binary file not shown.
Loading

0 comments on commit aa0b134

Please sign in to comment.