-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
59 changed files
with
1,932 additions
and
601 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -135,3 +135,4 @@ tests/caption-evaluation-tools | |
tests/fense | ||
tmp/ | ||
tmp*/ | ||
*.mdb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# -*- coding: utf-8 -*- | ||
|
||
cff-version: 1.2.0 | ||
title: aac-metrics | ||
message: 'If you use this software, please cite it as below.' | ||
type: software | ||
authors: | ||
- given-names: Etienne | ||
family-names: Labbé | ||
email: [email protected] | ||
affiliation: IRIT | ||
orcid: 'https://orcid.org/0000-0002-7219-5463' | ||
repository-code: 'https://github.com/Labbeti/aac-metrics/' | ||
abstract: Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch. | ||
keywords: | ||
- audio | ||
- metrics | ||
- text | ||
- captioning | ||
- audio-captioning | ||
license: MIT | ||
version: 0.4.0 | ||
date-released: '2023-04-13' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,18 +27,18 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch. | |
- SPICE [[5]](#spice) | ||
- SPIDEr [[6]](#spider) | ||
- SPIDEr-max [[7]](#spider-max) | ||
- SBERT [[8]](#fense) | ||
- FluencyError [[8]](#fense) | ||
- SBERT-sim [[8]](#fense) | ||
- Fluency Error [[8]](#fense) | ||
- FENSE [[8]](#fense) | ||
- SPIDErErr | ||
- SPIDEr-FL [[9]](#spider-fl) | ||
|
||
## Installation | ||
Install the pip package: | ||
```bash | ||
pip install aac-metrics | ||
``` | ||
|
||
Download the external code and models needed for METEOR, SPICE, PTBTokenizer and FENSE: | ||
Download the external code and models needed for METEOR, SPICE, SPIDEr, SPIDEr-max, PTBTokenizer, SBERT, FluencyError, FENSE and SPIDEr-FL: | ||
```bash | ||
aac-metrics-download | ||
``` | ||
|
@@ -48,23 +48,31 @@ Notes: | |
- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`. | ||
|
||
## Usage | ||
### Evaluate default AAC metrics | ||
The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function. | ||
### Evaluate default metrics | ||
The full evaluation pipeline to compute AAC metrics can be done with `aac_metrics.evaluate` function. | ||
|
||
```python | ||
from aac_metrics import aac_evaluate | ||
from aac_metrics import evaluate | ||
|
||
candidates: list[str] = ["a man is speaking"] | ||
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]] | ||
|
||
corpus_scores, _ = aac_evaluate(candidates, mult_references) | ||
corpus_scores, _ = evaluate(candidates, mult_references) | ||
print(corpus_scores) | ||
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider" | ||
# dict containing the score of each metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider" | ||
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...} | ||
``` | ||
### Evaluate DCASE2023 metrics | ||
To compute metrics for the DCASE2023 challenge, just set the argument `metrics="dcase2023"` in `evaluate` function call. | ||
|
||
```python | ||
corpus_scores, _ = evaluate(candidates, mult_references, metrics="dcase2023") | ||
print(corpus_scores) | ||
# dict containing the score of each metric: "meteor", "cider_d", "spice", "spider", "spider_fl", "fluerr" | ||
``` | ||
|
||
### Evaluate a specific metric | ||
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions. | ||
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions. | ||
|
||
```python | ||
from aac_metrics.functional import cider_d | ||
|
@@ -86,7 +94,7 @@ print(sents_scores) | |
Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`. | ||
|
||
## Metrics | ||
### Default AAC metrics | ||
### Legacy metrics | ||
| Metric | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams | | ||
|
@@ -96,83 +104,14 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci | |
| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph | | ||
| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE | | ||
|
||
### Other metrics | ||
### AAC-specific metrics | ||
| Metric name | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates | | ||
| SBERT [[7]](#spider-max) | `SBERT` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** | | ||
| FluencyError [[7]](#spider-max) | `FluencyError` | audio captioning | [0, 1] | Use pretrained model to detect fluency errors in sentences | | ||
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines `SBERT` and `FluencyError` | | ||
| SPIDErErr | `SPIDErErr` | audio captioning | [0, 5.5] | Combines `SPIDEr` and `FluencyError` | | ||
|
||
## SPIDEr-max metric | ||
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. | ||
|
||
### SPIDEr-max: why ? | ||
The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used. | ||
|
||
Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references: | ||
|
||
<div align="center"> | ||
|
||
| Beam search candidates | SPIDEr | | ||
|:---|:---:| | ||
| heavy rain is falling on a roof | 0.562 | | ||
| heavy rain is falling on **a tin roof** | **0.930** | | ||
| a heavy rain is falling on a roof | 0.594 | | ||
| a heavy rain is falling on the ground | 0.335 | | ||
| a heavy rain is falling on the roof | 0.594 | | ||
|
||
| References | | ||
|:---| | ||
| heavy rain falls loudly onto a structure with a thin roof | | ||
| heavy rainfall falling onto a thin structure with a thin roof | | ||
| it is raining hard and the rain hits **a tin roof** | | ||
| rain that is pouring down very hard outside | | ||
| the hard rain is noisy as it hits **a tin roof** | | ||
|
||
_(Candidates and references for the Clotho development-testing file named "rain.wav")_ | ||
|
||
| Beam search candidates | SPIDEr | | ||
|:---|:---:| | ||
| a woman speaks and a sheep bleats | 0.190 | | ||
| a woman **speaks and a goat bleats** | **1.259** | | ||
| a man speaks and a sheep bleats | 0.344 | | ||
| an adult male speaks and a sheep bleats | 0.231 | | ||
| an adult male is speaking and a sheep bleats | 0.189 | | ||
|
||
| References | | ||
|:---| | ||
| a man speaking and laughing followed by a goat bleat | | ||
| a man is speaking in high tone while a goat is bleating one time | | ||
| a man speaks followed by a goat bleat | | ||
| a person **speaks and a goat bleats** | | ||
| a man is talking and snickering followed by a goat bleating | | ||
|
||
_(Candidates and references for an AudioCaps testing file with the id "jid4t-FzUn0")_ | ||
</div> | ||
|
||
Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps [[7]](#spider-max). | ||
|
||
### SPIDEr-max: usage | ||
This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input. | ||
|
||
```python | ||
from aac_metrics.functional import spider_max | ||
from aac_metrics.utils.tokenization import preprocess_mult_sents | ||
|
||
mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]] | ||
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]] | ||
|
||
mult_candidates = preprocess_mult_sents(mult_candidates) | ||
mult_references = preprocess_mult_sents(mult_references) | ||
|
||
corpus_scores, sents_scores = spider_max(mult_candidates, mult_references) | ||
print(corpus_scores) | ||
# {"spider": tensor(0.1), ...} | ||
print(sents_scores) | ||
# {"spider": tensor([0.9, ...]), ...} | ||
``` | ||
| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** | | ||
| Fluency Error [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Use a pretrained model to detect fluency errors in sentences | | ||
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error | | ||
| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error | | ||
|
||
## Requirements | ||
### Python packages | ||
|
@@ -187,12 +126,11 @@ sentence-transformers>=2.2.2 | |
``` | ||
|
||
### External requirements | ||
- `java` >= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer. | ||
- `java` **>= 1.8 and <= 1.11** is required to compute METEOR, SPICE and use the PTBTokenizer. | ||
Most of these functions can specify a java executable path with `java_path` argument. | ||
|
||
- `unzip` command to extract SPICE zipped files. | ||
|
||
|
||
## Additional notes | ||
### CIDEr or CIDEr-D ? | ||
The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools), but some papers called it "CIDEr". | ||
|
@@ -204,6 +142,9 @@ No. Most of these metrics use numpy or external java programs to run, which prev | |
No. But if torchmetrics is installed, all metrics classes will inherit from the base class `torchmetrics.Metric`. | ||
It is because most of the metrics does not use PyTorch tensors to compute scores and numpy and strings cannot be added to states of `torchmetrics.Metric`. | ||
|
||
## SPIDEr-max metric | ||
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html). | ||
|
||
## References | ||
#### BLEU | ||
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a | ||
|
@@ -246,10 +187,13 @@ arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370 | |
[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396 | ||
|
||
#### FENSE | ||
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 | ||
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 | ||
|
||
#### SPIDEr-FL | ||
[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation | ||
|
||
## Citation | ||
If you use **SPIDEr-max**, you can cite the following paper using BibTex: | ||
If you use **SPIDEr-max**, you can cite the following paper using BibTex : | ||
``` | ||
@inproceedings{labbe:hal-03810396, | ||
TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}}, | ||
|
@@ -266,6 +210,20 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex: | |
} | ||
``` | ||
|
||
If you use this software, please consider cite it as below : | ||
``` | ||
@software{ | ||
Labbe_aac-metrics_2023, | ||
author = {Labbé, Etienne}, | ||
license = {MIT}, | ||
month = {4}, | ||
title = {{aac-metrics}}, | ||
url = {https://github.com/Labbeti/aac-metrics/}, | ||
version = {0.4.0}, | ||
year = {2023}, | ||
} | ||
``` | ||
|
||
## Contact | ||
Maintainer: | ||
- Etienne Labbé "Labbeti": [email protected] |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
aac\_metrics.classes.fluerr module | ||
================================== | ||
|
||
.. automodule:: aac_metrics.classes.fluerr | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
Oops, something went wrong.