Skip to content

Commit

Permalink
Version 0.4.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Labbeti committed Apr 13, 2023
1 parent 056b048 commit bfedab2
Show file tree
Hide file tree
Showing 59 changed files with 1,932 additions and 601 deletions.
58 changes: 27 additions & 31 deletions .github/workflows/python-package-pip.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,52 +10,44 @@ on:

jobs:
build:
runs-on: ubuntu-latest
runs-on: ${{ matrix.os }}

strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.9"]
java-version: ["11"]

steps:
# --- INSTALLATIONS ---
- name: Checkout repository and submodules

- name: Checkout repository
uses: actions/checkout@v2
with:
submodules: recursive

- name: Set up Python 3.9
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: 3.9

- name: Set up Java 11
python-version: ${{ matrix.python-version }}
cache: 'pip'

- name: Set up Java ${{ matrix.java-version }}
uses: actions/setup-java@v2
with:
distribution: 'temurin'
java-version: '11'

- name: Load cache of pip dependencies
uses: actions/cache@master
id: cache_requirements
with:
path: ${{ env.pythonLocation }}/lib/python3.9/site-packages/*
key: ${{ runner.os }}-pip-${{ hashFiles('setup.cfg') }}
restore-keys: |
${{ runner.os }}-pip-
${{ runner.os }}-
java-version: ${{ matrix.java-version }}
java-package: jre

- name: Install pip dev dependencies + package if needed
if: steps.cache_requirements.outputs.cache-hit != 'true'
- name: Install package
run: |
python -m pip install --upgrade pip
python -m pip install -e .[dev]
- name: Install package if needed
if: steps.cache_requirements.outputs.cache-hit == 'true'
run: |
python -m pip install -e . --no-dependencies
- name: Load cache of external code
- name: Load cache of external code and data
uses: actions/cache@master
id: cache_external
with:
path: /home/runner/.cache/aac-metrics-/*
path: /home/runner/.cache/aac-metrics/*
key: ${{ runner.os }}-${{ hashFiles('install_spice.sh') }}
restore-keys: |
${{ runner.os }}-
Expand All @@ -68,15 +60,19 @@ jobs:
- name: Check format with Black
run: |
python -m black --check --diff src
- name: Print install info
run: |
aac-metrics-info
- name: Print Java version
run: |
java --version
- name: Install external code if needed
if: steps.cache_external.outputs.cache-hit != 'true'
run: |
aac-metrics-download
- name: Print install info
run: |
aac-metrics-info
- name: Test with pytest
run: |
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -135,3 +135,4 @@ tests/caption-evaluation-tools
tests/fense
tmp/
tmp*/
*.mdb
1 change: 1 addition & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
path = tests/caption-evaluation-tools
url = https://github.com/audio-captioning/caption-evaluation-tools
branch = master

[submodule "fense"]
path = tests/fense
url = https://github.com/blmoistawinde/fense
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,22 @@

All notable changes to this project will be documented in this file.

## [0.4.0] 2023-04-13
### Added
- Argument `return_probs` for fluency error metric.

### Changed
- Rename `SPIDErErr` to `SPIDErFL` to match DCASE2023 metric name.
- Rename `SBERT` to `SBERTSim` to avoid confusion with SBERT model name.
- Rename `FluencyError` to `FluErr`.
- Check if Java executable version between 8 and 11.

### Fixed
- `SPIDErFL` sentences scores outputs when using `return_all_scores=True`.
- Argument `reset_state` in `SPIDErFL`, `SBERTSim`, `FluErr` and `FENSE` when using their functional interface.
- Classes and functions factories now support SPICE and CIDEr-D metrics.
- `SBERTSim` class instantiation.

## [0.3.0] 2023-02-27
### Added
- Parameters `timeout` and `separate_cache_dir` in `SPICE` function and class.
Expand Down
23 changes: 23 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-

cff-version: 1.2.0
title: aac-metrics
message: 'If you use this software, please cite it as below.'
type: software
authors:
- given-names: Etienne
family-names: Labbé
email: [email protected]
affiliation: IRIT
orcid: 'https://orcid.org/0000-0002-7219-5463'
repository-code: 'https://github.com/Labbeti/aac-metrics/'
abstract: Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
keywords:
- audio
- metrics
- text
- captioning
- audio-captioning
license: MIT
version: 0.4.0
date-released: '2023-04-13'
136 changes: 47 additions & 89 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,18 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
- SPICE [[5]](#spice)
- SPIDEr [[6]](#spider)
- SPIDEr-max [[7]](#spider-max)
- SBERT [[8]](#fense)
- FluencyError [[8]](#fense)
- SBERT-sim [[8]](#fense)
- Fluency Error [[8]](#fense)
- FENSE [[8]](#fense)
- SPIDErErr
- SPIDEr-FL [[9]](#spider-fl)

## Installation
Install the pip package:
```bash
pip install aac-metrics
```

Download the external code and models needed for METEOR, SPICE, PTBTokenizer and FENSE:
Download the external code and models needed for METEOR, SPICE, SPIDEr, SPIDEr-max, PTBTokenizer, SBERT, FluencyError, FENSE and SPIDEr-FL:
```bash
aac-metrics-download
```
Expand All @@ -48,23 +48,31 @@ Notes:
- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`.

## Usage
### Evaluate default AAC metrics
The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
### Evaluate default metrics
The full evaluation pipeline to compute AAC metrics can be done with `aac_metrics.evaluate` function.

```python
from aac_metrics import aac_evaluate
from aac_metrics import evaluate

candidates: list[str] = ["a man is speaking"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

corpus_scores, _ = aac_evaluate(candidates, mult_references)
corpus_scores, _ = evaluate(candidates, mult_references)
print(corpus_scores)
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# dict containing the score of each metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
```
### Evaluate DCASE2023 metrics
To compute metrics for the DCASE2023 challenge, just set the argument `metrics="dcase2023"` in `evaluate` function call.

```python
corpus_scores, _ = evaluate(candidates, mult_references, metrics="dcase2023")
print(corpus_scores)
# dict containing the score of each metric: "meteor", "cider_d", "spice", "spider", "spider_fl", "fluerr"
```

### Evaluate a specific metric
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.

```python
from aac_metrics.functional import cider_d
Expand All @@ -86,7 +94,7 @@ print(sents_scores)
Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.

## Metrics
### Default AAC metrics
### Legacy metrics
| Metric | Python Class | Origin | Range | Short description |
|:---|:---|:---|:---|:---|
| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
Expand All @@ -96,83 +104,14 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph |
| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |

### Other metrics
### AAC-specific metrics
| Metric name | Python Class | Origin | Range | Short description |
|:---|:---|:---|:---|:---|
| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
| SBERT [[7]](#spider-max) | `SBERT` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
| FluencyError [[7]](#spider-max) | `FluencyError` | audio captioning | [0, 1] | Use pretrained model to detect fluency errors in sentences |
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines `SBERT` and `FluencyError` |
| SPIDErErr | `SPIDErErr` | audio captioning | [0, 5.5] | Combines `SPIDEr` and `FluencyError` |

## SPIDEr-max metric
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.

### SPIDEr-max: why ?
The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used.

Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references:

<div align="center">

| Beam search candidates | SPIDEr |
|:---|:---:|
| heavy rain is falling on a roof | 0.562 |
| heavy rain is falling on **a tin roof** | **0.930** |
| a heavy rain is falling on a roof | 0.594 |
| a heavy rain is falling on the ground | 0.335 |
| a heavy rain is falling on the roof | 0.594 |

| References |
|:---|
| heavy rain falls loudly onto a structure with a thin roof |
| heavy rainfall falling onto a thin structure with a thin roof |
| it is raining hard and the rain hits **a tin roof** |
| rain that is pouring down very hard outside |
| the hard rain is noisy as it hits **a tin roof** |

_(Candidates and references for the Clotho development-testing file named "rain.wav")_

| Beam search candidates | SPIDEr |
|:---|:---:|
| a woman speaks and a sheep bleats | 0.190 |
| a woman **speaks and a goat bleats** | **1.259** |
| a man speaks and a sheep bleats | 0.344 |
| an adult male speaks and a sheep bleats | 0.231 |
| an adult male is speaking and a sheep bleats | 0.189 |

| References |
|:---|
| a man speaking and laughing followed by a goat bleat |
| a man is speaking in high tone while a goat is bleating one time |
| a man speaks followed by a goat bleat |
| a person **speaks and a goat bleats** |
| a man is talking and snickering followed by a goat bleating |

_(Candidates and references for an AudioCaps testing file with the id "jid4t-FzUn0")_
</div>

Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps [[7]](#spider-max).

### SPIDEr-max: usage
This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.

```python
from aac_metrics.functional import spider_max
from aac_metrics.utils.tokenization import preprocess_mult_sents

mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]

mult_candidates = preprocess_mult_sents(mult_candidates)
mult_references = preprocess_mult_sents(mult_references)

corpus_scores, sents_scores = spider_max(mult_candidates, mult_references)
print(corpus_scores)
# {"spider": tensor(0.1), ...}
print(sents_scores)
# {"spider": tensor([0.9, ...]), ...}
```
| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
| Fluency Error [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Use a pretrained model to detect fluency errors in sentences |
| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error |
| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error |

## Requirements
### Python packages
Expand All @@ -187,12 +126,11 @@ sentence-transformers>=2.2.2
```

### External requirements
- `java` >= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer.
- `java` **>= 1.8 and <= 1.11** is required to compute METEOR, SPICE and use the PTBTokenizer.
Most of these functions can specify a java executable path with `java_path` argument.

- `unzip` command to extract SPICE zipped files.


## Additional notes
### CIDEr or CIDEr-D ?
The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools), but some papers called it "CIDEr".
Expand All @@ -204,6 +142,9 @@ No. Most of these metrics use numpy or external java programs to run, which prev
No. But if torchmetrics is installed, all metrics classes will inherit from the base class `torchmetrics.Metric`.
It is because most of the metrics does not use PyTorch tensors to compute scores and numpy and strings cannot be added to states of `torchmetrics.Metric`.

## SPIDEr-max metric
SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html).

## References
#### BLEU
[1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a
Expand Down Expand Up @@ -246,10 +187,13 @@ arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
[7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396

#### FENSE
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684
[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684

#### SPIDEr-FL
[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation

## Citation
If you use **SPIDEr-max**, you can cite the following paper using BibTex:
If you use **SPIDEr-max**, you can cite the following paper using BibTex :
```
@inproceedings{labbe:hal-03810396,
TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}},
Expand All @@ -266,6 +210,20 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex:
}
```

If you use this software, please consider cite it as below :
```
@software{
Labbe_aac-metrics_2023,
author = {Labbé, Etienne},
license = {MIT},
month = {4},
title = {{aac-metrics}},
url = {https://github.com/Labbeti/aac-metrics/},
version = {0.4.0},
year = {2023},
}
```

## Contact
Maintainer:
- Etienne Labbé "Labbeti": [email protected]
7 changes: 0 additions & 7 deletions docs/aac_metrics.classes.fluency_error.rst

This file was deleted.

7 changes: 7 additions & 0 deletions docs/aac_metrics.classes.fluerr.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.classes.fluerr module
==================================

.. automodule:: aac_metrics.classes.fluerr
:members:
:undoc-members:
:show-inheritance:
Loading

0 comments on commit bfedab2

Please sign in to comment.