-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
59 changed files
with
2,024 additions
and
431 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,20 +17,22 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch. | |
</div> | ||
|
||
## Why using this package? | ||
- **Easy installation and download** | ||
- **Same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories** | ||
- **Provides the following metrics:** | ||
- **Easy to install and download** | ||
- **Produces same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories** | ||
- **Provides 12 different metrics:** | ||
- BLEU [[1]](#bleu) | ||
- ROUGE-L [[2]](#rouge-l) | ||
- METEOR [[3]](#meteor) | ||
- CIDEr-D [[4]](#cider) | ||
- SPICE [[5]](#spice) | ||
- SPIDEr [[6]](#spider) | ||
- SPIDEr-max [[7]](#spider-max) | ||
- SBERT-sim [[8]](#fense) | ||
- Fluency Error [[8]](#fense) | ||
- FENSE [[8]](#fense) | ||
- SPIDEr-FL [[9]](#spider-fl) | ||
- BERTScore [[7]](#bertscore) | ||
- SPIDEr-max [[8]](#spider-max) | ||
- SBERT-sim [[9]](#fense) | ||
- FER [[9]](#fense) | ||
- FENSE [[9]](#fense) | ||
- SPIDEr-FL [[10]](#spider-fl) | ||
- Vocab (unique word vocabulary) | ||
|
||
## Installation | ||
Install the pip package: | ||
|
@@ -100,28 +102,37 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci | |
|
||
## Metrics | ||
### Legacy metrics | ||
| Metric | Python Class | Origin | Range | Short description | | ||
| Metric name | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams | | ||
| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | text summarization | [0, 1] | FScore of the longest common subsequence | | ||
| METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies with synonyms matching | | ||
| CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams | | ||
| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of a semantic graph | | ||
| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE | | ||
| BERTScore [[7]](#bertscore) | `BERTScoreMRefs` | text generation | [0, 1] | Fscore of BERT embeddings. In contrast to torchmetrics, it supports multiple references per file. | | ||
|
||
### AAC-specific metrics | ||
| Metric name | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates | | ||
| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** | | ||
| Fluency error rate [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model | | ||
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate | | ||
| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate | | ||
| SPIDEr-max [[8]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates | | ||
| SBERT-sim [[9]](#fense) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** | | ||
| Fluency Error Rate [[9]](#fense) | `FER` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model | | ||
| FENSE [[9]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate | | ||
| SPIDEr-FL [[10]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate | | ||
|
||
### Other metrics | ||
| Metric name | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| Vocabulary | `Vocab` | text generation | [0, +$\infty$[ | Number of unique words in candidates. | | ||
|
||
### AAC metrics not implemented | ||
- CB-Score [[10]](#cb-score) | ||
- SPICE+ [[11]](#spice-plus) | ||
- ACES [[12]](#aces) (can be found here: https://github.com/GlJS/ACES) | ||
### Future directions | ||
This package currently does not include all metrics dedicated to audio captioning. Feel free to do a pull request / or ask to me by email if you want to include them. Those metrics not included are listed here: | ||
- CB-Score [[11]](#cb-score) | ||
- SPICE+ [[12]](#spice-plus) | ||
- ACES [[13]](#aces) (can be found here: https://github.com/GlJS/ACES) | ||
- SBF [[14]](#sbf) | ||
- s2v [[15]](#s2v) | ||
|
||
## Requirements | ||
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported. | ||
|
@@ -136,6 +147,7 @@ pyyaml >= 6.0 | |
tqdm >= 4.64.0 | ||
sentence-transformers >= 2.2.2 | ||
transformers < 4.31.0 | ||
torchmetrics >= 0.11.4 | ||
``` | ||
|
||
### External requirements | ||
|
@@ -154,64 +166,54 @@ No. Most of these metrics use numpy or external java programs to run, which prev | |
### Do metrics work on Windows/Mac OS? | ||
Maybe. Most of the metrics only need python to run, which can be done on Windows. However, you might expect errors with METEOR metric, SPICE-based metrics and PTB tokenizer, since they requires an external java program to run. | ||
|
||
## SPIDEr-max metric | ||
## About SPIDEr-max metric | ||
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html). | ||
|
||
## References | ||
#### BLEU | ||
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a | ||
method for automatic evaluation of machine translation,” in Proceed- | ||
ings of the 40th Annual Meeting on Association for Computational | ||
Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association | ||
for Computational Linguistics, 2001, p. 311. [Online]. Available: | ||
http://portal.acm.org/citation.cfm?doid=1073083.1073135 | ||
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135 | ||
|
||
#### ROUGE-L | ||
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” | ||
in Text Summarization Branches Out. Barcelona, Spain: Association | ||
for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: | ||
https://aclanthology.org/W04-1013 | ||
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013 | ||
|
||
#### METEOR | ||
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific | ||
Translation Evaluation for Any Target Language,” in Proceedings of the | ||
Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, | ||
USA: Association for Computational Linguistics, 2014, pp. 376–380. | ||
[Online]. Available: http://aclweb.org/anthology/W14-3348 | ||
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348 | ||
|
||
#### CIDEr | ||
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based | ||
Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv: | ||
1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726 | ||
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, [Online]. Available: http://arxiv.org/abs/1411.5726 | ||
|
||
#### SPICE | ||
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic | ||
Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, | ||
arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822 | ||
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, [Online]. Available: http://arxiv.org/abs/1607.08822 | ||
|
||
#### SPIDEr | ||
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image | ||
Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter- | ||
national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, | ||
arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370 | ||
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370 | ||
|
||
#### BERTScore | ||
[7] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr | ||
|
||
#### SPIDEr-max | ||
[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396 | ||
[8] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396 | ||
|
||
#### FENSE | ||
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 | ||
[9] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 | ||
|
||
#### SPIDEr-FL | ||
[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation | ||
[10] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation | ||
|
||
#### CB-score | ||
[11] I. Martín-Morató, M. Harju, and A. Mesaros, “A Summarization Approach to Evaluating Audio Captioning,” Nov. 2022. [Online]. Available: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Martin-Morato_35.pdf | ||
|
||
#### SPICE-plus | ||
[10] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021. | ||
[12] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021. | ||
|
||
#### ACES | ||
[12] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023. | ||
[13] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023. | ||
|
||
#### SBF | ||
[14] R. Mahfuz, Y. Guo, A. K. Sridhar, and E. Visser, Detecting False Alarms and Misses in Audio Captions. 2023. [Online]. Available: https://arxiv.org/pdf/2309.03326.pdf | ||
|
||
#### s2v | ||
[15] S. Bhosale, R. Chakraborty, and S. K. Kopparapu, “A Novel Metric For Evaluating Audio Caption Similarity,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10096526. | ||
|
||
## Citation | ||
If you use **SPIDEr-max**, you can cite the following paper using BibTex : | ||
|
@@ -227,20 +229,21 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex : | |
} | ||
``` | ||
|
||
If you use this software, please consider cite it as below : | ||
If you use this software, please consider cite it as "Labbe, E. (2013). aac-metrics: Metrics for evaluating Automated Audio Captioning systems for PyTorch.", or use the following BibTeX citation: | ||
|
||
``` | ||
@software{ | ||
Labbe_aac-metrics_2023, | ||
Labbe_aac_metrics_2023, | ||
author = {Labbé, Etienne}, | ||
license = {MIT}, | ||
month = {10}, | ||
month = {12}, | ||
title = {{aac-metrics}}, | ||
url = {https://github.com/Labbeti/aac-metrics/}, | ||
version = {0.4.6}, | ||
version = {0.5.0}, | ||
year = {2023}, | ||
} | ||
``` | ||
|
||
## Contact | ||
Maintainer: | ||
- Etienne Labbé "Labbeti": [email protected] | ||
- Étienne Labbé "Labbeti": [email protected] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.classes.bert\_score\_mrefs module | ||
============================================== | ||
|
||
.. automodule:: aac_metrics.classes.bert_score_mrefs | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.classes.fer module | ||
=============================== | ||
|
||
.. automodule:: aac_metrics.classes.fer | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.classes.vocab module | ||
================================= | ||
|
||
.. automodule:: aac_metrics.classes.vocab | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.functional.bert\_score\_mrefs module | ||
================================================= | ||
|
||
.. automodule:: aac_metrics.functional.bert_score_mrefs | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.functional.fer module | ||
================================== | ||
|
||
.. automodule:: aac_metrics.functional.fer | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.functional.vocab module | ||
==================================== | ||
|
||
.. automodule:: aac_metrics.functional.vocab | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.utils.cmdline module | ||
================================= | ||
|
||
.. automodule:: aac_metrics.utils.cmdline | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,15 +21,7 @@ classifiers = [ | |
maintainers = [ | ||
{name = "Etienne Labbé (Labbeti)", email = "[email protected]"}, | ||
] | ||
dependencies = [ | ||
"torch>=1.10.1", | ||
"numpy>=1.21.2", | ||
"pyyaml>=6.0", | ||
"tqdm>=4.64.0", | ||
"sentence-transformers>=2.2.2", | ||
"transformers<4.31.0", | ||
] | ||
dynamic = ["version"] | ||
dynamic = ["version", "dependencies", "optional-dependencies"] | ||
|
||
[project.urls] | ||
Homepage = "https://pypi.org/project/aac-metrics/" | ||
|
@@ -43,19 +35,11 @@ aac-metrics-download = "aac_metrics.download:_main_download" | |
aac-metrics-eval = "aac_metrics.eval:_main_eval" | ||
aac-metrics-info = "aac_metrics.info:print_install_info" | ||
|
||
[project.optional-dependencies] | ||
dev = [ | ||
"pytest==7.1.2", | ||
"flake8==4.0.1", | ||
"black==22.8.0", | ||
"scikit-image==0.19.2", | ||
"matplotlib==3.5.2", | ||
"torchmetrics>=0.10", | ||
] | ||
|
||
[tool.setuptools.packages.find] | ||
where = ["src"] # list of folders that contain the packages (["."] by default) | ||
include = ["aac_metrics*"] # package names should match these glob patterns (["*"] by default) | ||
|
||
[tool.setuptools.dynamic] | ||
version = {attr = "aac_metrics.__version__"} | ||
dependencies = {file = ["requirements.txt"]} | ||
optional-dependencies = {dev = { file = ["requirements-dev.txt"] }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
pytest==7.1.2 | ||
flake8==4.0.1 | ||
black==22.8.0 | ||
scikit-image==0.19.2 | ||
matplotlib==3.5.2 | ||
ipykernel==6.9.1 | ||
twine==4.0.1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,3 +6,4 @@ pyyaml>=6.0 | |
tqdm>=4.64.0 | ||
sentence-transformers>=2.2.2 | ||
transformers<4.31.0 | ||
torchmetrics>=0.11.4 |
Oops, something went wrong.