Version 0.4.0

Labbeti · Apr 13, 2023 · bfedab2 · bfedab2
1 parent 056b048
commit bfedab2
Show file tree

Hide file tree

Showing 59 changed files with 1,932 additions and 601 deletions.
diff --git a/.github/workflows/python-package-pip.yaml b/.github/workflows/python-package-pip.yaml
@@ -10,52 +10,44 @@ on:
 
 jobs:
   build:
-    runs-on: ubuntu-latest
+    runs-on: ${{ matrix.os }}
+
+    strategy:
+      matrix:
+        os: [ubuntu-latest]
+        python-version: ["3.9"]
+        java-version: ["11"]
 
     steps:
     # --- INSTALLATIONS ---
-    - name: Checkout repository and submodules
+
+    - name: Checkout repository
       uses: actions/checkout@v2
       with:
         submodules: recursive
 
-    - name: Set up Python 3.9
+    - name: Set up Python ${{ matrix.python-version }}
       uses: actions/setup-python@v2
       with:
-        python-version: 3.9
-
-    - name: Set up Java 11
+        python-version: ${{ matrix.python-version }}
+        cache: 'pip'
+
+    - name: Set up Java ${{ matrix.java-version }}
       uses: actions/setup-java@v2
       with:
         distribution: 'temurin'
-        java-version: '11'
-
-    - name: Load cache of pip dependencies
-      uses: actions/cache@master
-      id: cache_requirements
-      with:
-        path: ${{ env.pythonLocation }}/lib/python3.9/site-packages/*
-        key: ${{ runner.os }}-pip-${{ hashFiles('setup.cfg') }}
-        restore-keys: |
-          ${{ runner.os }}-pip-
-          ${{ runner.os }}-
+        java-version: ${{ matrix.java-version }}
+        java-package: jre
 
-    - name: Install pip dev dependencies + package if needed
-      if: steps.cache_requirements.outputs.cache-hit != 'true'
+    - name: Install package
       run: |
-        python -m pip install --upgrade pip
         python -m pip install -e .[dev]
-
-    - name: Install package if needed
-      if: steps.cache_requirements.outputs.cache-hit == 'true'
-      run: |
-        python -m pip install -e . --no-dependencies
   
-    - name: Load cache of external code
+    - name: Load cache of external code and data
       uses: actions/cache@master
       id: cache_external
       with:
-        path: /home/runner/.cache/aac-metrics-/*
+        path: /home/runner/.cache/aac-metrics/*
         key: ${{ runner.os }}-${{ hashFiles('install_spice.sh') }}
         restore-keys: |
           ${{ runner.os }}-
@@ -68,15 +60,19 @@ jobs:
     - name: Check format with Black
       run: |
         python -m black --check --diff src
+
+    - name: Print install info
+      run: |
+        aac-metrics-info
+
+    - name: Print Java version
+      run: |
+        java --version
   
     - name: Install external code if needed
       if: steps.cache_external.outputs.cache-hit != 'true'
       run: |
         aac-metrics-download
-      
-    - name: Print install info
-      run: |
-        aac-metrics-info
   
     - name: Test with pytest
       run: |

diff --git a/.gitignore b/.gitignore
@@ -135,3 +135,4 @@ tests/caption-evaluation-tools
 tests/fense
 tmp/
 tmp*/
+*.mdb
diff --git a/.gitmodules b/.gitmodules
@@ -2,6 +2,7 @@
 	path = tests/caption-evaluation-tools
 	url = https://github.com/audio-captioning/caption-evaluation-tools
 	branch = master
+
 [submodule "fense"]
 	path = tests/fense
 	url = https://github.com/blmoistawinde/fense
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,22 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.4.0] 2023-04-13
+### Added
+- Argument `return_probs` for fluency error metric.
+
+### Changed
+- Rename `SPIDErErr` to `SPIDErFL` to match DCASE2023 metric name.
+- Rename `SBERT` to `SBERTSim` to avoid confusion with SBERT model name.
+- Rename `FluencyError` to `FluErr`.
+- Check if Java executable version between 8 and 11.
+
+### Fixed
+- `SPIDErFL` sentences scores outputs when using `return_all_scores=True`.
+- Argument `reset_state` in `SPIDErFL`, `SBERTSim`, `FluErr` and `FENSE` when using their functional interface.
+- Classes and functions factories now support SPICE and CIDEr-D metrics.
+- `SBERTSim` class instantiation.
+
 ## [0.3.0] 2023-02-27
 ### Added
 - Parameters `timeout` and `separate_cache_dir` in `SPICE` function and class.

diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,23 @@
+# -*- coding: utf-8 -*-
+
+cff-version: 1.2.0
+title: aac-metrics
+message: 'If you use this software, please cite it as below.'
+type: software
+authors:
+  - given-names: Etienne
+    family-names: Labbé
+    email: [email protected]
+    affiliation: IRIT
+    orcid: 'https://orcid.org/0000-0002-7219-5463'
+repository-code: 'https://github.com/Labbeti/aac-metrics/'
+abstract: Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
+keywords:
+  - audio
+  - metrics
+  - text
+  - captioning
+  - audio-captioning
+license: MIT
+version: 0.4.0
+date-released: '2023-04-13'
diff --git a/README.md b/README.md
@@ -27,18 +27,18 @@ Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
     - SPICE [[5]](#spice)
     - SPIDEr [[6]](#spider)
     - SPIDEr-max [[7]](#spider-max)
-    - SBERT [[8]](#fense)
-    - FluencyError [[8]](#fense)
+    - SBERT-sim [[8]](#fense)
+    - Fluency Error [[8]](#fense)
     - FENSE [[8]](#fense)
-    - SPIDErErr
+    - SPIDEr-FL [[9]](#spider-fl)
 
 ## Installation
 Install the pip package:
 ```bash
 pip install aac-metrics
 ```
 
-Download the external code and models needed for METEOR, SPICE, PTBTokenizer and FENSE:
+Download the external code and models needed for METEOR, SPICE, SPIDEr, SPIDEr-max, PTBTokenizer, SBERT, FluencyError, FENSE and SPIDEr-FL:
 ```bash
 aac-metrics-download
 ```
@@ -48,23 +48,31 @@ Notes:
 - The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`.
 
 ## Usage
-### Evaluate default AAC metrics
-The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
+### Evaluate default metrics
+The full evaluation pipeline to compute AAC metrics can be done with `aac_metrics.evaluate` function.
 
 ```python
-from aac_metrics import aac_evaluate
+from aac_metrics import evaluate
 
 candidates: list[str] = ["a man is speaking"]
 mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
 
-corpus_scores, _ = aac_evaluate(candidates, mult_references)
+corpus_scores, _ = evaluate(candidates, mult_references)
 print(corpus_scores)
-# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
+# dict containing the score of each metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
 # {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
 ```
+### Evaluate DCASE2023 metrics
+To compute metrics for the DCASE2023 challenge, just set the argument `metrics="dcase2023"` in `evaluate` function call.
+
+```python
+corpus_scores, _ = evaluate(candidates, mult_references, metrics="dcase2023")
+print(corpus_scores)
+# dict containing the score of each metric: "meteor", "cider_d", "spice", "spider", "spider_fl", "fluerr"
+```
 
 ### Evaluate a specific metric
-Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
+Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class. Unlike `evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
 
 ```python
 from aac_metrics.functional import cider_d
@@ -86,7 +94,7 @@ print(sents_scores)
 Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
 
 ## Metrics
-### Default AAC metrics
+### Legacy metrics
 | Metric | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
 | BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
@@ -96,83 +104,14 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
 | SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph |
 | SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
 
-### Other metrics
+### AAC-specific metrics
 | Metric name | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
 | SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
-| SBERT [[7]](#spider-max) | `SBERT` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
-| FluencyError [[7]](#spider-max) | `FluencyError` | audio captioning | [0, 1] | Use pretrained model to detect fluency errors in sentences |
-| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines `SBERT` and `FluencyError` |
-| SPIDErErr | `SPIDErErr` | audio captioning | [0, 5.5] | Combines `SPIDEr` and `FluencyError` |
-
-## SPIDEr-max metric
-SPIDEr-max [[7]](#spider-max)  is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.
-
-### SPIDEr-max: why ?
-The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used.
-
-Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references:
-
-<div align="center">
-
-| Beam search candidates | SPIDEr |
-|:---|:---:|
-| heavy rain is falling on a roof | 0.562 |
-| heavy rain is falling on **a tin roof** | **0.930** |
-| a heavy rain is falling on a roof | 0.594 |
-| a heavy rain is falling on the ground | 0.335 |
-| a heavy rain is falling on the roof | 0.594 |
-
-| References |
-|:---|
-| heavy rain falls loudly onto a structure with a thin roof |
-| heavy rainfall falling onto a thin structure with a thin roof |
-| it is raining hard and the rain hits **a tin roof** |
-| rain that is pouring down very hard outside |
-| the hard rain is noisy as it hits **a tin roof** |
-
-_(Candidates and references for the Clotho development-testing file named "rain.wav")_
-
-| Beam search candidates | SPIDEr |
-|:---|:---:|
-| a woman speaks and a sheep bleats | 0.190 |
-| a woman **speaks and a goat bleats** | **1.259** |
-| a man speaks and a sheep bleats | 0.344 |
-| an adult male speaks and a sheep bleats | 0.231 |
-| an adult male is speaking and a sheep bleats | 0.189 |
-
-| References |
-|:---|
-| a man speaking and laughing followed by a goat bleat |
-| a man is speaking in high tone while a goat is bleating one time |
-| a man speaks followed by a goat bleat |
-| a person **speaks and a goat bleats** |
-| a man is talking and snickering followed by a goat bleating |
-
-_(Candidates and references for an AudioCaps testing file with the id "jid4t-FzUn0")_
-</div>
-
-Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps [[7]](#spider-max).
-
-### SPIDEr-max: usage
-This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.
-
-```python
-from aac_metrics.functional import spider_max
-from aac_metrics.utils.tokenization import preprocess_mult_sents
-
-mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
-
-mult_candidates = preprocess_mult_sents(mult_candidates)
-mult_references = preprocess_mult_sents(mult_references)
-
-corpus_scores, sents_scores = spider_max(mult_candidates, mult_references)
-print(corpus_scores)
-# {"spider": tensor(0.1), ...}
-print(sents_scores)
-# {"spider": tensor([0.9, ...]), ...}
-```
+| SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
+| Fluency Error [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Use a pretrained model to detect fluency errors in sentences |
+| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error |
+| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error |
 
 ## Requirements
 ### Python packages
@@ -187,12 +126,11 @@ sentence-transformers>=2.2.2
 ```
 
 ### External requirements
-- `java` >= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer.
+- `java` **>= 1.8 and <= 1.11** is required to compute METEOR, SPICE and use the PTBTokenizer.
 Most of these functions can specify a java executable path with `java_path` argument.
 
 - `unzip` command to extract SPICE zipped files.
 
-
 ## Additional notes
 ### CIDEr or CIDEr-D ?
 The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools), but some papers called it "CIDEr".
@@ -204,6 +142,9 @@ No. Most of these metrics use numpy or external java programs to run, which prev
 No. But if torchmetrics is installed, all metrics classes will inherit from the base class `torchmetrics.Metric`.
 It is because most of the metrics does not use PyTorch tensors to compute scores and numpy and strings cannot be added to states of `torchmetrics.Metric`.
 
+## SPIDEr-max metric
+SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html).
+
 ## References
 #### BLEU
 [1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a
@@ -246,10 +187,13 @@ arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
 [7] E. Labbé, T. Pellegrini, and J. Pinquier, “Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates,” Nov. 2022. [Online]. Available: https://hal.archives-ouvertes.fr/hal-03810396
 
 #### FENSE
-[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 
+[8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684
+
+#### SPIDEr-FL
+[9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation
 
 ## Citation
-If you use **SPIDEr-max**, you can cite the following paper using BibTex:
+If you use **SPIDEr-max**, you can cite the following paper using BibTex :
 ```
 @inproceedings{labbe:hal-03810396,
     TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}},
@@ -266,6 +210,20 @@ If you use **SPIDEr-max**, you can cite the following paper using BibTex:
 }
 ```
 
+If you use this software, please consider cite it as below :
+```
+@software{
+    Labbe_aac-metrics_2023,
+    author = {Labbé, Etienne},
+    license = {MIT},
+    month = {4},
+    title = {{aac-metrics}},
+    url = {https://github.com/Labbeti/aac-metrics/},
+    version = {0.4.0},
+    year = {2023},
+}
+```
+
 ## Contact
 Maintainer:
 - Etienne Labbé "Labbeti": [email protected]
diff --git a/docs/aac_metrics.classes.fluency_error.rst b/docs/aac_metrics.classes.fluency_error.rst
diff --git a/docs/aac_metrics.classes.fluerr.rst b/docs/aac_metrics.classes.fluerr.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.fluerr module
+==================================
+
+.. automodule:: aac_metrics.classes.fluerr
+   :members:
+   :undoc-members:
+   :show-inheritance: