Skip to content

Commit

Permalink
Version 0.5.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Labbeti committed Dec 8, 2023
1 parent e3c161d commit 45139d6
Show file tree
Hide file tree
Showing 59 changed files with 2,024 additions and 431 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/python-package-pip.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ on:

env:
CACHE_NUMBER: 0 # increase to reset cache manually
TMPDIR: '/tmp'
AAC_METRICS_TMP_PATH: '/tmp'

# Cancel workflow if a new push occurs
concurrency:
Expand All @@ -23,7 +23,7 @@ jobs:

strategy:
matrix:
os: [ubuntu-latest,windows-latest]
os: [ubuntu-latest,windows-latest,macos-latest]
python-version: ["3.9"]
java-version: ["11"]

Expand Down
1 change: 1 addition & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
[submodule "caption-evaluation-tools"]
path = tests/caption-evaluation-tools
url = https://github.com/audio-captioning/caption-evaluation-tools
ignore = dirty
branch = master

[submodule "fense"]
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,18 @@

All notable changes to this project will be documented in this file.

## [0.5.0] 2023-12-08
### Added
- New `Vocab` metric to compute vocabulary size and vocabulary ratio.
- New `BERTScoreMRefs` metric wrapper to compute BERTScore with multiple references.

### Changed
- Rename metric `FluErr` to `FER`.

### Fixed
- `METEOR` localization issue. ([#9](https://github.com/Labbeti/aac-metrics/issues/9))
- `SPIDErMax` output when `return_all_scores=False`.

## [0.4.6] 2023-10-10
### Added
- Argument `clean_archives` for `SPICE` download.
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ keywords:
- captioning
- audio-captioning
license: MIT
version: 0.4.6
date-released: '2023-10-10'
version: 0.5.0
date-released: '2023-12-08'
111 changes: 57 additions & 54 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,22 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
</div>

## Why using this package?
- **Easy installation and download**
- **Same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories**
- **Provides the following metrics:**
- **Easy to install and download**
- **Produces same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories**
- **Provides 12 different metrics:**
- BLEU [[1]](#bleu)
- ROUGE-L [[2]](#rouge-l)
- METEOR [[3]](#meteor)
- CIDEr-D [[4]](#cider)
- SPICE [[5]](#spice)
- SPIDEr [[6]](#spider)
- SPIDEr-max [[7]](#spider-max)
- SBERT-sim [[8]](#fense)
- Fluency Error [[8]](#fense)
- FENSE [[8]](#fense)
- SPIDEr-FL [[9]](#spider-fl)
- BERTScore [[7]](#bertscore)
- SPIDEr-max [[8]](#spider-max)
- SBERT-sim [[9]](#fense)
- FER [[9]](#fense)
- FENSE [[9]](#fense)
- SPIDEr-FL [[10]](#spider-fl)
- Vocab (unique word vocabulary)

## Installation
Install the pip package:
Expand Down Expand Up @@ -100,28 +102,37 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci

## Metrics
### Legacy metrics
| Metric | Python Class | Origin | Range | Short description |
| Metric name | Python Class | Origin | Range | Short description |
|:---|:---|:---|:---|:---|
| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | text summarization | [0, 1] | FScore of the longest common subsequence |
| METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies with synonyms matching |
| CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams |
| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of a semantic graph |
| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
| BERTScore [[7]](#bertscore) | `BERTScoreMRefs` | text generation | [0, 1] | Fscore of BERT embeddings. In contrast to torchmetrics, it supports multiple references per file. |

### AAC-specific metrics
| Metric name | Python Class | Origin | Range | Short description |
|:---|:---|:---|:---|:---|
| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
| Fluency error rate [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model |
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate |
| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate |
| SPIDEr-max [[8]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
| SBERT-sim [[9]](#fense) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
| Fluency Error Rate [[9]](#fense) | `FER` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model |
| FENSE [[9]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate |
| SPIDEr-FL [[10]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate |

### Other metrics
| Metric name | Python Class | Origin | Range | Short description |
|:---|:---|:---|:---|:---|
| Vocabulary | `Vocab` | text generation | [0, +$\infty$[ | Number of unique words in candidates. |

### AAC metrics not implemented
- CB-Score [[10]](#cb-score)
- SPICE+ [[11]](#spice-plus)
- ACES [[12]](#aces) (can be found here: https://github.com/GlJS/ACES)
### Future directions
This package currently does not include all metrics dedicated to audio captioning. Feel free to do a pull request / or ask to me by email if you want to include them. Those metrics not included are listed here:
- CB-Score [[11]](#cb-score)
- SPICE+ [[12]](#spice-plus)
- ACES [[13]](#aces) (can be found here: https://github.com/GlJS/ACES)
- SBF [[14]](#sbf)
- s2v [[15]](#s2v)

## Requirements
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported.
Expand All @@ -136,6 +147,7 @@ pyyaml >= 6.0
tqdm >= 4.64.0
sentence-transformers >= 2.2.2
transformers < 4.31.0
torchmetrics >= 0.11.4
```

### External requirements
Expand All @@ -154,64 +166,54 @@ No. Most of these metrics use numpy or external java programs to run, which prev
### Do metrics work on Windows/Mac OS?
Maybe. Most of the metrics only need python to run, which can be done on Windows. However, you might expect errors with METEOR metric, SPICE-based metrics and PTB tokenizer, since they requires an external java program to run.

## SPIDEr-max metric
## About SPIDEr-max metric
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html).

## References
#### BLEU
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a
method for automatic evaluation of machine translation,” in Proceed-
ings of the 40th Annual Meeting on Association for Computational
Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association
for Computational Linguistics, 2001, p. 311. [Online]. Available:
http://portal.acm.org/citation.cfm?doid=1073083.1073135
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association for Computational Linguistics, 2001, p. 311. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1073083.1073135

#### ROUGE-L
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,”
in Text Summarization Branches Out. Barcelona, Spain: Association
for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
https://aclanthology.org/W04-1013
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013

#### METEOR
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific
Translation Evaluation for Any Target Language,” in Proceedings of the
Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland,
USA: Association for Computational Linguistics, 2014, pp. 376–380.
[Online]. Available: http://aclweb.org/anthology/W14-3348
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific Translation Evaluation for Any Target Language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, USA: Association for Computational Linguistics, 2014, pp. 376–380. [Online]. Available: http://aclweb.org/anthology/W14-3348

#### CIDEr
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based
Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv:
1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, [Online]. Available: http://arxiv.org/abs/1411.5726

#### SPICE
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic
Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016,
arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, [Online]. Available: http://arxiv.org/abs/1607.08822

#### SPIDEr
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image
Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter-
national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017,
arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370

#### BERTScore
[7] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr

#### SPIDEr-max
[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396
[8] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396

#### FENSE
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684
[9] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684

#### SPIDEr-FL
[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation
[10] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation

#### CB-score
[11] I. Martín-Morató, M. Harju, and A. Mesaros, “A Summarization Approach to Evaluating Audio Captioning,” Nov. 2022. [Online]. Available: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Martin-Morato_35.pdf

#### SPICE-plus
[10] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021.
[12] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021.

#### ACES
[12] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023.
[13] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023.

#### SBF
[14] R. Mahfuz, Y. Guo, A. K. Sridhar, and E. Visser, Detecting False Alarms and Misses in Audio Captions. 2023. [Online]. Available: https://arxiv.org/pdf/2309.03326.pdf

#### s2v
[15] S. Bhosale, R. Chakraborty, and S. K. Kopparapu, “A Novel Metric For Evaluating Audio Caption Similarity,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10096526.

## Citation
If you use **SPIDEr-max**, you can cite the following paper using BibTex :
Expand All @@ -227,20 +229,21 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex :
}
```

If you use this software, please consider cite it as below :
If you use this software, please consider cite it as "Labbe, E. (2013). aac-metrics: Metrics for evaluating Automated Audio Captioning systems for PyTorch.", or use the following BibTeX citation:

```
@software{
Labbe_aac-metrics_2023,
Labbe_aac_metrics_2023,
author = {Labbé, Etienne},
license = {MIT},
month = {10},
month = {12},
title = {{aac-metrics}},
url = {https://github.com/Labbeti/aac-metrics/},
version = {0.4.6},
version = {0.5.0},
year = {2023},
}
```

## Contact
Maintainer:
- Etienne Labbé "Labbeti": [email protected]
- Étienne Labbé "Labbeti": [email protected]
7 changes: 7 additions & 0 deletions docs/aac_metrics.classes.bert_score_mrefs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.classes.bert\_score\_mrefs module
==============================================

.. automodule:: aac_metrics.classes.bert_score_mrefs
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/aac_metrics.classes.fer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.classes.fer module
===============================

.. automodule:: aac_metrics.classes.fer
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/aac_metrics.classes.vocab.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.classes.vocab module
=================================

.. automodule:: aac_metrics.classes.vocab
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/aac_metrics.functional.bert_score_mrefs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.functional.bert\_score\_mrefs module
=================================================

.. automodule:: aac_metrics.functional.bert_score_mrefs
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/aac_metrics.functional.fer.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.functional.fer module
==================================

.. automodule:: aac_metrics.functional.fer
:members:
:undoc-members:
:show-inheritance:
7 changes: 0 additions & 7 deletions docs/aac_metrics.functional.fluerr.rst

This file was deleted.

7 changes: 7 additions & 0 deletions docs/aac_metrics.functional.vocab.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.functional.vocab module
====================================

.. automodule:: aac_metrics.functional.vocab
:members:
:undoc-members:
:show-inheritance:
7 changes: 7 additions & 0 deletions docs/aac_metrics.utils.cmdline.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.utils.cmdline module
=================================

.. automodule:: aac_metrics.utils.cmdline
:members:
:undoc-members:
:show-inheritance:
22 changes: 3 additions & 19 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,7 @@ classifiers = [
maintainers = [
{name = "Etienne Labbé (Labbeti)", email = "[email protected]"},
]
dependencies = [
"torch>=1.10.1",
"numpy>=1.21.2",
"pyyaml>=6.0",
"tqdm>=4.64.0",
"sentence-transformers>=2.2.2",
"transformers<4.31.0",
]
dynamic = ["version"]
dynamic = ["version", "dependencies", "optional-dependencies"]

[project.urls]
Homepage = "https://pypi.org/project/aac-metrics/"
Expand All @@ -43,19 +35,11 @@ aac-metrics-download = "aac_metrics.download:_main_download"
aac-metrics-eval = "aac_metrics.eval:_main_eval"
aac-metrics-info = "aac_metrics.info:print_install_info"

[project.optional-dependencies]
dev = [
"pytest==7.1.2",
"flake8==4.0.1",
"black==22.8.0",
"scikit-image==0.19.2",
"matplotlib==3.5.2",
"torchmetrics>=0.10",
]

[tool.setuptools.packages.find]
where = ["src"] # list of folders that contain the packages (["."] by default)
include = ["aac_metrics*"] # package names should match these glob patterns (["*"] by default)

[tool.setuptools.dynamic]
version = {attr = "aac_metrics.__version__"}
dependencies = {file = ["requirements.txt"]}
optional-dependencies = {dev = { file = ["requirements-dev.txt"] }}
9 changes: 9 additions & 0 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# -*- coding: utf-8 -*-

pytest==7.1.2
flake8==4.0.1
black==22.8.0
scikit-image==0.19.2
matplotlib==3.5.2
ipykernel==6.9.1
twine==4.0.1
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ pyyaml>=6.0
tqdm>=4.64.0
sentence-transformers>=2.2.2
transformers<4.31.0
torchmetrics>=0.11.4
Loading

0 comments on commit 45139d6

Please sign in to comment.