Version 0.5.0

Labbeti · Dec 8, 2023 · 45139d6 · 45139d6
1 parent e3c161d
commit 45139d6
Show file tree

Hide file tree

Showing 59 changed files with 2,024 additions and 431 deletions.
diff --git a/.github/workflows/python-package-pip.yaml b/.github/workflows/python-package-pip.yaml
@@ -10,7 +10,7 @@ on:
 
 env:
   CACHE_NUMBER: 0  # increase to reset cache manually
-  TMPDIR: '/tmp'
+  AAC_METRICS_TMP_PATH: '/tmp'
 
 # Cancel workflow if a new push occurs
 concurrency:
@@ -23,7 +23,7 @@ jobs:
 
     strategy:
       matrix:
-        os: [ubuntu-latest,windows-latest]
+        os: [ubuntu-latest,windows-latest,macos-latest]
         python-version: ["3.9"]
         java-version: ["11"]
 

diff --git a/.gitmodules b/.gitmodules
@@ -1,6 +1,7 @@
 [submodule "caption-evaluation-tools"]
 	path = tests/caption-evaluation-tools
 	url = https://github.com/audio-captioning/caption-evaluation-tools
+    ignore = dirty
 	branch = master
 
 [submodule "fense"]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,18 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.5.0] 2023-12-08
+### Added
+- New `Vocab` metric to compute vocabulary size and vocabulary ratio.
+- New `BERTScoreMRefs` metric wrapper to compute BERTScore with multiple references.
+
+### Changed
+- Rename metric `FluErr` to `FER`.
+
+### Fixed
+- `METEOR` localization issue. ([#9](https://github.com/Labbeti/aac-metrics/issues/9))
+- `SPIDErMax` output when `return_all_scores=False`.
+
 ## [0.4.6] 2023-10-10
 ### Added
 - Argument `clean_archives` for `SPICE` download.

diff --git a/CITATION.cff b/CITATION.cff
@@ -19,5 +19,5 @@ keywords:
   - captioning
   - audio-captioning
 license: MIT
-version: 0.4.6
-date-released: '2023-10-10'
+version: 0.5.0
+date-released: '2023-12-08'
diff --git a/README.md b/README.md
@@ -17,20 +17,22 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
 </div>
 
 ## Why using this package?
-- **Easy installation and download**
-- **Same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories**
-- **Provides the following metrics:**
+- **Easy to install and download**
+- **Produces same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories**
+- **Provides 12 different metrics:**
     - BLEU [[1]](#bleu)
     - ROUGE-L [[2]](#rouge-l)
     - METEOR [[3]](#meteor)
     - CIDEr-D [[4]](#cider)
     - SPICE [[5]](#spice)
     - SPIDEr [[6]](#spider)
-    - SPIDEr-max [[7]](#spider-max)
-    - SBERT-sim [[8]](#fense)
-    - Fluency Error [[8]](#fense)
-    - FENSE [[8]](#fense)
-    - SPIDEr-FL [[9]](#spider-fl)
+    - BERTScore [[7]](#bertscore)
+    - SPIDEr-max [[8]](#spider-max)
+    - SBERT-sim [[9]](#fense)
+    - FER [[9]](#fense)
+    - FENSE [[9]](#fense)
+    - SPIDEr-FL [[10]](#spider-fl)
+    - Vocab (unique word vocabulary)
 
 ## Installation
 Install the pip package:
@@ -100,28 +102,37 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
 
 ## Metrics
 ### Legacy metrics
-| Metric | Python Class | Origin | Range | Short description |
+| Metric name | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
 | BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
 | ROUGE-L [[2]](#rouge-l) | `ROUGEL` | text summarization | [0, 1] | FScore of the longest common subsequence |
 | METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies with synonyms matching |
 | CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams |
 | SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of a semantic graph |
 | SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
+| BERTScore [[7]](#bertscore) | `BERTScoreMRefs` | text generation | [0, 1] | Fscore of BERT embeddings. In contrast to torchmetrics, it supports multiple references per file. |
 
 ### AAC-specific metrics
 | Metric name | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
-| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
-| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
-| Fluency error rate [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model |
-| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate |
-| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate |
+| SPIDEr-max [[8]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
+| SBERT-sim [[9]](#fense) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
+| Fluency Error Rate [[9]](#fense) | `FER` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model |
+| FENSE [[9]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate |
+| SPIDEr-FL [[10]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate |
+
+### Other metrics
+| Metric name | Python Class | Origin | Range | Short description |
+|:---|:---|:---|:---|:---|
+| Vocabulary | `Vocab` | text generation | [0, +$\infty$[ | Number of unique words in candidates. |
 
-### AAC metrics not implemented
-- CB-Score [[10]](#cb-score)
-- SPICE+ [[11]](#spice-plus)
-- ACES [[12]](#aces) (can be found here: https://github.com/GlJS/ACES)
+### Future directions
+This package currently does not include all metrics dedicated to audio captioning. Feel free to do a pull request / or ask to me by email if you want to include them. Those metrics not included are listed here:
+- CB-Score [[11]](#cb-score)
+- SPICE+ [[12]](#spice-plus)
+- ACES [[13]](#aces) (can be found here: https://github.com/GlJS/ACES)
+- SBF [[14]](#sbf)
+- s2v [[15]](#s2v)
 
 ## Requirements
 This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported.
@@ -136,6 +147,7 @@ pyyaml >= 6.0
 tqdm >= 4.64.0
 sentence-transformers >= 2.2.2
 transformers < 4.31.0
+torchmetrics >= 0.11.4
 ```
 
 ### External requirements
@@ -154,64 +166,54 @@ No. Most of these metrics use numpy or external java programs to run, which prev
 ### Do metrics work on Windows/Mac OS?
 Maybe. Most of the metrics only need python to run, which can be done on Windows. However, you might expect errors with METEOR metric, SPICE-based metrics and PTB tokenizer, since they requires an external java program to run.
 
-## SPIDEr-max metric
+## About SPIDEr-max metric
 SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html).
 
 ## References
 #### BLEU
-[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a
-method for automatic evaluation of machine translation,” in Proceed-
-ings of the 40th Annual Meeting on Association for Computational
-Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association
-for Computational Linguistics, 2001, p. 311. [Online]. Available:
-http://portal.acm.org/citation.cfm?doid=1073083.1073135
+[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135
 
 #### ROUGE-L
-[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,”
-in Text Summarization Branches Out. Barcelona, Spain: Association
-for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
-https://aclanthology.org/W04-1013
+[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
 
 #### METEOR
-[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific
-Translation Evaluation for Any Target Language,” in Proceedings of the
-Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland,
-USA: Association for Computational Linguistics, 2014, pp. 376–380.
-[Online]. Available: http://aclweb.org/anthology/W14-3348
+[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348
 
 #### CIDEr
-[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
-Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv:
-1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726
+[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, [Online]. Available: http://arxiv.org/abs/1411.5726
 
 #### SPICE
-[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic
-Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016,
-arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822
+[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, [Online]. Available: http://arxiv.org/abs/1607.08822
 
 #### SPIDEr
-[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image
-Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter-
-national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017,
-arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
+[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
+
+#### BERTScore
+[7] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr 
 
 #### SPIDEr-max
-[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396
+[8] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396
 
 #### FENSE
-[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684
+[9] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684
 
 #### SPIDEr-FL
-[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation
+[10] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation
 
 #### CB-score
 [11] I. Martín-Morató, M. Harju, and A. Mesaros, “A Summarization Approach to Evaluating Audio Captioning,” Nov. 2022. [Online]. Available: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Martin-Morato_35.pdf 
 
 #### SPICE-plus
-[10] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021. 
+[12] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021. 
 
 #### ACES
-[12] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023.
+[13] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023.
+
+#### SBF
+[14] R. Mahfuz, Y. Guo, A. K. Sridhar, and E. Visser, Detecting False Alarms and Misses in Audio Captions. 2023. [Online]. Available: https://arxiv.org/pdf/2309.03326.pdf 
+
+#### s2v
+[15] S. Bhosale, R. Chakraborty, and S. K. Kopparapu, “A Novel Metric For Evaluating Audio Caption Similarity,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10096526. 
 
 ## Citation
 If you use **SPIDEr-max**, you can cite the following paper using BibTex :
@@ -227,20 +229,21 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex :
 }
 ```
 
-If you use this software, please consider cite it as below :
+If you use this software, please consider cite it as "Labbe, E. (2013). aac-metrics: Metrics for evaluating Automated Audio Captioning systems for PyTorch.", or use the following BibTeX citation:
+
 ```
 @software{
-    Labbe_aac-metrics_2023,
+    Labbe_aac_metrics_2023,
     author = {Labbé, Etienne},
     license = {MIT},
-    month = {10},
+    month = {12},
     title = {{aac-metrics}},
     url = {https://github.com/Labbeti/aac-metrics/},
-    version = {0.4.6},
+    version = {0.5.0},
     year = {2023},
 }
 ```
 
 ## Contact
 Maintainer:
-- Etienne Labbé "Labbeti": [email protected]
+- Étienne Labbé "Labbeti": [email protected]
diff --git a/docs/aac_metrics.classes.bert_score_mrefs.rst b/docs/aac_metrics.classes.bert_score_mrefs.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.bert\_score\_mrefs module
+==============================================
+
+.. automodule:: aac_metrics.classes.bert_score_mrefs
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.fer.rst b/docs/aac_metrics.classes.fer.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.fer module
+===============================
+
+.. automodule:: aac_metrics.classes.fer
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.vocab.rst b/docs/aac_metrics.classes.vocab.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.vocab module
+=================================
+
+.. automodule:: aac_metrics.classes.vocab
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.functional.bert_score_mrefs.rst b/docs/aac_metrics.functional.bert_score_mrefs.rst
@@ -0,0 +1,7 @@
+aac\_metrics.functional.bert\_score\_mrefs module
+=================================================
+
+.. automodule:: aac_metrics.functional.bert_score_mrefs
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.functional.fer.rst b/docs/aac_metrics.functional.fer.rst
@@ -0,0 +1,7 @@
+aac\_metrics.functional.fer module
+==================================
+
+.. automodule:: aac_metrics.functional.fer
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.functional.fluerr.rst b/docs/aac_metrics.functional.fluerr.rst
diff --git a/docs/aac_metrics.functional.vocab.rst b/docs/aac_metrics.functional.vocab.rst
@@ -0,0 +1,7 @@
+aac\_metrics.functional.vocab module
+====================================
+
+.. automodule:: aac_metrics.functional.vocab
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.utils.cmdline.rst b/docs/aac_metrics.utils.cmdline.rst
@@ -0,0 +1,7 @@
+aac\_metrics.utils.cmdline module
+=================================
+
+.. automodule:: aac_metrics.utils.cmdline
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/pyproject.toml b/pyproject.toml
@@ -21,15 +21,7 @@ classifiers = [
 maintainers = [
     {name = "Etienne Labbé (Labbeti)", email = "[email protected]"},
 ]
-dependencies = [
-    "torch>=1.10.1",
-    "numpy>=1.21.2",
-    "pyyaml>=6.0",
-    "tqdm>=4.64.0",
-    "sentence-transformers>=2.2.2",
-    "transformers<4.31.0",
-]
-dynamic = ["version"]
+dynamic = ["version", "dependencies", "optional-dependencies"]
 
 [project.urls]
 Homepage = "https://pypi.org/project/aac-metrics/"
@@ -43,19 +35,11 @@ aac-metrics-download = "aac_metrics.download:_main_download"
 aac-metrics-eval = "aac_metrics.eval:_main_eval"
 aac-metrics-info = "aac_metrics.info:print_install_info"
 
-[project.optional-dependencies]
-dev = [
-    "pytest==7.1.2",
-    "flake8==4.0.1",
-    "black==22.8.0",
-    "scikit-image==0.19.2",
-    "matplotlib==3.5.2",
-    "torchmetrics>=0.10",
-]
-
 [tool.setuptools.packages.find]
 where = ["src"]  # list of folders that contain the packages (["."] by default)
 include = ["aac_metrics*"]  # package names should match these glob patterns (["*"] by default)
 
 [tool.setuptools.dynamic]
 version = {attr = "aac_metrics.__version__"}
+dependencies = {file = ["requirements.txt"]}
+optional-dependencies = {dev = { file = ["requirements-dev.txt"] }}
diff --git a/requirements-dev.txt b/requirements-dev.txt
@@ -0,0 +1,9 @@
+# -*- coding: utf-8 -*-
+
+pytest==7.1.2
+flake8==4.0.1
+black==22.8.0
+scikit-image==0.19.2
+matplotlib==3.5.2
+ipykernel==6.9.1
+twine==4.0.1
diff --git a/requirements.txt b/requirements.txt
@@ -6,3 +6,4 @@ pyyaml>=6.0
 tqdm>=4.64.0
 sentence-transformers>=2.2.2
 transformers<4.31.0
+torchmetrics>=0.11.4