-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
28 changed files
with
354 additions
and
173 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,5 @@ | ||
|
||
recursive-include . *.py | ||
recursive-include src *.py | ||
global-exclude *.pyc | ||
global-exclude __pycache__ | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,42 +14,42 @@ Audio Captioning metrics source code, designed for Pytorch. | |
</div> | ||
|
||
This package is a tool to evaluate sentences produced by automatic models to caption image or audio. | ||
The results of BLEU [1], ROUGE-L [2], METEOR [3], CIDEr [4], SPICE [5] and SPIDEr [6] are consistents with https://github.com/audio-captioning/caption-evaluation-tools. | ||
The results of BLEU [[1]](#bleu), ROUGE-L [[2]](#rouge-l), METEOR [[3]](#meteor), CIDEr-D [[4]](#cider), SPICE [[5]](#spice) and SPIDEr [[6]](#spider) are consistents with [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools). | ||
|
||
## Why using this package? | ||
- Easy installation with pip | ||
- Consistent with audio caption metrics code https://github.com/audio-captioning/caption-evaluation-tools | ||
- Consistent with [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) | ||
- Provides functions and classes to compute metrics separately | ||
- Provides SPIDEr-max metric as described in the DCASE paper [7]. | ||
- Provides SPIDEr-max metric as described in the DCASE paper [[7]](#spider-max) | ||
|
||
## Installation | ||
Install the pip package: | ||
``` | ||
```bash | ||
pip install aac-metrics | ||
``` | ||
|
||
Download the external code needed for METEOR, SPICE and PTBTokenizer: | ||
``` | ||
```bash | ||
aac-metrics-download | ||
``` | ||
|
||
Note: The external code for SPICE, METEOR and PTBTokenizer is stored in the cache directory (default: `$HOME/aac-metrics-cache/`) | ||
|
||
## Metrics | ||
### AAC metrics | ||
| Metric | Origin | Range | Short description | | ||
|:---:|:---:|:---:|:---:| | ||
| BLEU [1] | machine translation | [0, 1] | Precision of n-grams | | ||
| ROUGE-L [2] | machine translation | [0, 1] | FScore of the longest common subsequence | | ||
| METEOR [3] | machine translation | [0, 1] | Cosine-similarity of frequencies | | ||
| CIDEr-D [4] | image captioning | [0, 10] | Cosine-similarity of TF-IDF | | ||
| SPICE [5] | image captioning | [0, 1] | FScore of semantic graph | | ||
| SPIDEr [6] | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE | | ||
### Default AAC metrics | ||
| Metric | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| BLEU [[1]](#bleu) | `CocoBLEU` | machine translation | [0, 1] | Precision of n-grams | | ||
| ROUGE-L [[2]](#rouge-l) | `CocoRougeL` | machine translation | [0, 1] | FScore of the longest common subsequence | | ||
| METEOR [[3]](#meteor) | `CocoMETEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies | | ||
| CIDEr-D [[4]](#cider) | `CocoCIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF | | ||
| SPICE [[5]](#spice) | `CocoSPICE` | image captioning | [0, 1] | FScore of semantic graph | | ||
| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE | | ||
|
||
### Other metrics | ||
| Metric | Origin | Range | Short description | | ||
|:---:|:---:|:---:|:---:| | ||
| SPIDEr-max [7] | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates | | ||
| Metric name | Python Class | Origin | Range | Short description | | ||
|:---|:---|:---|:---|:---| | ||
| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates | | ||
|
||
## Usage | ||
### Evaluate AAC metrics | ||
|
@@ -68,7 +68,7 @@ print(global_scores) | |
``` | ||
|
||
### Evaluate a specific metric | ||
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it before with `preprocess_mono_sents` and `preprocess_mult_sents` functions. | ||
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions. | ||
|
||
```python | ||
from aac_metrics.functional import coco_cider_d | ||
|
@@ -89,8 +89,8 @@ print(local_scores) | |
|
||
Each metrics also exists as a python class version, like `aac_metrics.classes.coco_cider_d.CocoCIDErD`. | ||
|
||
## SPIDEr-max | ||
SPIDEr-max [7] is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. | ||
## SPIDEr-max metric | ||
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. | ||
|
||
### SPIDEr-max: why ? | ||
The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used. | ||
|
@@ -176,59 +176,73 @@ Most of these functions can specify a java executable path with `java_path` argu | |
|
||
## Additional notes | ||
### CIDEr or CIDEr-D ? | ||
The CIDEr [4] metric differs from CIDEr-D because it apply a stemmer to each words before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr, but some papers called it "CIDEr". | ||
The CIDEr [4] metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr, but some papers called it "CIDEr". | ||
|
||
### Is torchmetrics needed for this package ? | ||
No. But if torchmetrics is installed, all metrics classes will inherit from the base class `torchmetrics.Metric`. | ||
It is because most of the metrics does not use PyTorch tensors to compute scores and numpy or string cannot be added to states of `torchmetrics.Metric`. | ||
It is because most of the metrics does not use PyTorch tensors to compute scores and numpy and strings cannot be added to states of `torchmetrics.Metric`. | ||
|
||
***Additional note*** : even when torchmetrics is installed, this package does not support multiple-gpu testing. | ||
|
||
## References | ||
#### BLEU | ||
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a | ||
method for automatic evaluation of machine translation,” in Proceed- | ||
ings of the 40th Annual Meeting on Association for Computational | ||
Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association | ||
for Computational Linguistics, 2001, p. 311. [Online]. Available: | ||
http://portal.acm.org/citation.cfm?doid=1073083.1073135 | ||
|
||
#### Rouge-L | ||
[2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” | ||
in Text Summarization Branches Out. Barcelona, Spain: Association | ||
for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: | ||
https://aclanthology.org/W04-1013 | ||
|
||
#### METEOR | ||
[3] M. Denkowski and A. Lavie, “Meteor Universal: Language Specific | ||
Translation Evaluation for Any Target Language,” in Proceedings of the | ||
Ninth Workshop on Statistical Machine Translation. Baltimore, Maryland, | ||
USA: Association for Computational Linguistics, 2014, pp. 376–380. | ||
[Online]. Available: http://aclweb.org/anthology/W14-3348 | ||
|
||
#### CIDEr | ||
[4] R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based | ||
Image Description Evaluation,” arXiv:1411.5726 [cs], Jun. 2015, arXiv: | ||
1411.5726. [Online]. Available: http://arxiv.org/abs/1411.5726 | ||
|
||
#### SPICE | ||
[5] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic | ||
Propositional Image Caption Evaluation,” arXiv:1607.08822 [cs], Jul. 2016, | ||
arXiv: 1607.08822. [Online]. Available: http://arxiv.org/abs/1607.08822 | ||
|
||
#### SPIDEr | ||
[6] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved Image | ||
Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter- | ||
national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017, | ||
arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370 | ||
|
||
<!-- TODO : update ref --> | ||
Note: the following reference is **temporary**: | ||
|
||
[7] E. Labbe, T. Pellegrini, J. Pinquier, "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES", DCASE2022 Workshop. | ||
#### SPIDEr-max | ||
[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396 | ||
|
||
## Cite the aac-metrics package | ||
The associated paper has been accepted but it will be published after the DCASE2022 workshop. | ||
|
||
If you use this code, you can cite with the following **temporary** citation: | ||
<!-- TODO : update citation and create CITATION.cff file --> | ||
If you use this code with SPIDEr-max, you can cite the following paper: | ||
``` | ||
@inproceedings{Labbe2022, | ||
author = "Etienne Labbe, Thomas Pellegrini, Julien Pinquier", | ||
title = "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES", | ||
month = "November", | ||
year = "2022", | ||
@inproceedings{labbe:hal-03810396, | ||
TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}}, | ||
AUTHOR = {Labb{\'e}, Etienne and Pellegrini, Thomas and Pinquier, Julien}, | ||
URL = {https://hal.archives-ouvertes.fr/hal-03810396}, | ||
BOOKTITLE = {{Workshop DCASE}}, | ||
ADDRESS = {Nancy, France}, | ||
YEAR = {2022}, | ||
MONTH = Nov, | ||
KEYWORDS = {audio captioning ; evaluation metric ; beam search ; multiple candidates}, | ||
PDF = {https://hal.archives-ouvertes.fr/hal-03810396/file/Labbe_DCASE2022.pdf}, | ||
HAL_ID = {hal-03810396}, | ||
HAL_VERSION = {v1}, | ||
} | ||
``` | ||
|
||
## Contact | ||
Maintainer: | ||
- Etienne Labbé "Labbeti": [email protected] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,17 +6,18 @@ | |
|
||
__name__ = "aac-metrics" | ||
__author__ = "Etienne Labbé (Labbeti)" | ||
__author_email__ = "[email protected]" | ||
__license__ = "MIT" | ||
__maintainer__ = "Etienne Labbé (Labbeti)" | ||
__status__ = "Development" | ||
__version__ = "0.1.1" | ||
__version__ = "0.1.2" | ||
|
||
|
||
from .functional.evaluate import aac_evaluate | ||
from .classes.coco_bleu import CocoBLEU | ||
from .classes.coco_cider_d import CocoCIDErD | ||
from .classes.coco_meteor import CocoMETEOR | ||
from .classes.coco_rouge_l import CocoRougeL | ||
from .classes.coco_spice import CocoSPICE | ||
from .classes.evaluate import AACEvaluate | ||
from .classes.spider import SPIDEr | ||
from .functional.evaluate import aac_evaluate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.