Version 0.4.6

Labbeti · Oct 10, 2023 · e3c161d · e3c161d
1 parent a5f056f
commit e3c161d
Show file tree

Hide file tree

Showing 23 changed files with 603 additions and 625 deletions.
diff --git a/.github/workflows/python-package-pip.yaml b/.github/workflows/python-package-pip.yaml
@@ -50,7 +50,7 @@ jobs:
     - name: Install package
       shell: bash
       # note: ${GITHUB_REF##*/} gives the branch name
-      # note 2: dev is not the branch here, but the dev dependencies
+      # note 2: dev is NOT the branch here, but the dev dependencies
       run: |
         python -m pip install "aac-metrics[dev] @ git+https://github.com/Labbeti/aac-metrics@${GITHUB_REF##*/}"
   

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,22 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.4.6] 2023-10-10
+### Added
+- Argument `clean_archives` for `SPICE` download.
+
+### Changed
+- Check if newline character is in the sentences before ptb tokenization. ([#6](https://github.com/Labbeti/aac-metrics/issues/6))
+- `SPICE` no longer requires bash script files for installation.
+
+### Fixed
+- Maximal version of `transformers` dependancy set to 4.31.0 to avoid error with `FENSE` and `FluErr` metrics.
+- `SPICE` crash message and error output files.
+- Default value for `Evaluate` `metrics` argument.
+
+### Deleted
+- Remove now useless `use_shell` option for download.
+
 ## [0.4.5] 2023-09-12
 ### Added
 - Argument `use_shell` for `METEOR` and `SPICE` metrics and `download` function to fix Windows-OS specific error.

diff --git a/CITATION.cff b/CITATION.cff
@@ -19,5 +19,5 @@ keywords:
   - captioning
   - audio-captioning
 license: MIT
-version: 0.4.5
-date-released: '2023-09-12'
+version: 0.4.6
+date-released: '2023-10-10'
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -3,5 +3,4 @@ recursive-include src *.py
 global-exclude *.pyc
 global-exclude __pycache__
 
-include src/aac_metrics/install_spice.sh
 recursive-include data *.csv
diff --git a/README.md b/README.md
@@ -103,27 +103,31 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
 | Metric | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
 | BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
-| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | machine translation | [0, 1] | FScore of the longest common subsequence |
+| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | text summarization | [0, 1] | FScore of the longest common subsequence |
 | METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies with synonyms matching |
 | CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams |
-| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph |
+| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of a semantic graph |
 | SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
 
 ### AAC-specific metrics
 | Metric name | Python Class | Origin | Range | Short description |
 |:---|:---|:---|:---|:---|
 | SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
 | SBERT-sim [[8]](#spider-max) | `SBERTSim` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
-| Fluency Error [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Use a pretrained model to detect fluency errors in sentences |
-| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error |
-| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error |
+| Fluency error rate [[8]](#spider-max) | `FluErr` | audio captioning | [0, 1] | Detect fluency errors in sentences with a pretrained model |
+| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines SBERT-sim and Fluency Error rate |
+| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error rate |
+
+### AAC metrics not implemented
+- CB-Score [[10]](#cb-score)
+- SPICE+ [[11]](#spice-plus)
+- ACES [[12]](#aces) (can be found here: https://github.com/GlJS/ACES)
 
 ## Requirements
 This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported.
 
 ### Python packages
 
-
 The pip requirements are automatically installed when using `pip install` on this repository.
 ```
 torch >= 1.10.1
@@ -141,11 +145,14 @@ Most of these functions can specify a java executable path with `java_path` argu
 - `unzip` command to extract SPICE zipped files.
 
 ## Additional notes
-### CIDEr or CIDEr-D ?
+### CIDEr or CIDEr-D?
 The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools), but some papers called it "CIDEr".
 
-### Does metrics work on multi-GPU ?
-No. Most of these metrics use numpy or external java programs to run, which prevents multi-GPU testing for now.
+### Do metrics work on multi-GPU?
+No. Most of these metrics use numpy or external java programs to run, which prevents multi-GPU testing in parallel.
+
+### Do metrics work on Windows/Mac OS?
+Maybe. Most of the metrics only need python to run, which can be done on Windows. However, you might expect errors with METEOR metric, SPICE-based metrics and PTB tokenizer, since they requires an external java program to run.
 
 ## SPIDEr-max metric
 SPIDEr-max [[7]](#spider-max) is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model. For more detail, please see the [documentation about SPIDEr-max](https://aac-metrics.readthedocs.io/en/stable/spider_max.html).
@@ -197,6 +204,15 @@ arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
 #### SPIDEr-FL
 [9] DCASE website task6a description: https://dcase.community/challenge2023/task-automated-audio-captioning#evaluation
 
+#### CB-score
+[11] I. Martín-Morató, M. Harju, and A. Mesaros, “A Summarization Approach to Evaluating Audio Captioning,” Nov. 2022. [Online]. Available: https://dcase.community/documents/workshop2022/proceedings/DCASE2022Workshop_Martin-Morato_35.pdf 
+
+#### SPICE-plus
+[10] F. Gontier, R. Serizel, and C. Cerisara, “SPICE+: Evaluation of Automatic Audio Captioning Systems with Pre-Trained Language Models,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5. doi: 10.1109/ICASSP49357.2023.10097021. 
+
+#### ACES
+[12] G. Wijngaard, E. Formisano, B. L. Giordano, M. Dumontier, “ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds”, in EUSIPCO 2023, 2023.
+
 ## Citation
 If you use **SPIDEr-max**, you can cite the following paper using BibTex :
 ```
@@ -217,10 +233,10 @@ If you use this software, please consider cite it as below :
     Labbe_aac-metrics_2023,
     author = {Labbé, Etienne},
     license = {MIT},
-    month = {9},
+    month = {10},
     title = {{aac-metrics}},
     url = {https://github.com/Labbeti/aac-metrics/},
-    version = {0.4.5},
+    version = {0.4.6},
     year = {2023},
 }
 ```

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -26,3 +26,4 @@ The python requirements are automatically installed when using pip on this repos
     pyyaml>=6.0
     tqdm>=4.64.0
     sentence-transformers>=2.2.2
+    transformers<4.31.0
diff --git a/pyproject.toml b/pyproject.toml
@@ -27,6 +27,7 @@ dependencies = [
     "pyyaml>=6.0",
     "tqdm>=4.64.0",
     "sentence-transformers>=2.2.2",
+    "transformers<4.31.0",
 ]
 dynamic = ["version"]
 
@@ -50,7 +51,6 @@ dev = [
     "scikit-image==0.19.2",
     "matplotlib==3.5.2",
     "torchmetrics>=0.10",
-    "transformers<4.31.0",
 ]
 
 [tool.setuptools.packages.find]

diff --git a/src/aac_metrics/__init__.py b/src/aac_metrics/__init__.py
@@ -1,28 +1,32 @@
 #!/usr/bin/env python
 # -*- coding: utf-8 -*-
 
-"""Audio Captioning metrics package.
-"""
+"""Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch. """
+
 
-__name__ = "aac-metrics"
 __author__ = "Etienne Labbé (Labbeti)"
 __author_email__ = "[email protected]"
 __license__ = "MIT"
 __maintainer__ = "Etienne Labbé (Labbeti)"
+__name__ = "aac-metrics"
 __status__ = "Development"
-__version__ = "0.4.5"
+__version__ = "0.4.6"
 
 
 from .classes.base import AACMetric
 from .classes.bleu import BLEU
 from .classes.cider_d import CIDErD
-from .classes.evaluate import DCASE2023Evaluate, _get_metric_factory_classes
+from .classes.evaluate import Evaluate, DCASE2023Evaluate, _get_metric_factory_classes
+from .classes.fluerr import FluErr
 from .classes.fense import FENSE
 from .classes.meteor import METEOR
 from .classes.rouge_l import ROUGEL
+from .classes.sbert_sim import SBERTSim
 from .classes.spice import SPICE
 from .classes.spider import SPIDEr
-from .functional.evaluate import dcase2023_evaluate, evaluate
+from .classes.spider_fl import SPIDErFL
+from .classes.spider_max import SPIDErMax
+from .functional.evaluate import evaluate, dcase2023_evaluate
 from .utils.paths import (
     get_default_cache_path,
     get_default_java_path,
@@ -34,16 +38,22 @@
 
 
 __all__ = [
+    "AACMetric",
     "BLEU",
     "CIDErD",
+    "Evaluate",
     "DCASE2023Evaluate",
     "FENSE",
+    "FluErr",
     "METEOR",
     "ROUGEL",
+    "SBERTSim",
     "SPICE",
     "SPIDEr",
-    "dcase2023_evaluate",
+    "SPIDErFL",
+    "SPIDErMax",
     "evaluate",
+    "dcase2023_evaluate",
     "get_default_cache_path",
     "get_default_java_path",
     "get_default_tmp_path",
@@ -58,8 +68,8 @@ def load_metric(name: str, **kwargs) -> AACMetric:
     """Load a metric class by name.
 
     :param name: The name of the metric.
-        Must be one of ("bleu_1", "bleu_2", "bleu_3", "bleu_4", "meteor", "rouge_l", "cider_d", "spice", "spider", "fense").
-    :param **kwargs: The keyword optional arguments passed to the metric.
+        Can be one of ("bleu_1", "bleu_2", "bleu_3", "bleu_4", "meteor", "rouge_l", "cider_d", "spice", "spider", "fense").
+    :param **kwargs: The keyword optional arguments passed to the metric factory.
     :returns: The Metric object built.
     """
     name = name.lower().strip()

diff --git a/src/aac_metrics/__main__.py b/src/aac_metrics/__main__.py
@@ -6,7 +6,7 @@ def _print_usage() -> None:
     print(
         "Command line usage :\n"
         "- Download models and external code               : aac-metrics-download ...\n"
-        "- Print scores from candidate and references file : aac-metrics-evaluate -i [FILEPATH]\n"
+        "- Print scores from candidate and references file : aac-metrics-eval -i [FILEPATH]\n"
         "- Print package version                           : aac-metrics-info\n"
         "- Show this usage page                            : aac-metrics\n"
     )

diff --git a/src/aac_metrics/classes/evaluate.py b/src/aac_metrics/classes/evaluate.py
@@ -5,7 +5,7 @@
 import pickle
 import zlib
 
-from typing import Callable, Iterable, Union
+from typing import Any, Callable, Iterable, Union
 
 import torch
 
@@ -22,7 +22,11 @@
 from aac_metrics.classes.spice import SPICE
 from aac_metrics.classes.spider import SPIDEr
 from aac_metrics.classes.spider_fl import SPIDErFL
-from aac_metrics.functional.evaluate import METRICS_SETS, evaluate
+from aac_metrics.functional.evaluate import (
+    DEFAULT_METRICS_SET_NAME,
+    METRICS_SETS,
+    evaluate,
+)
 
 
 pylog = logging.getLogger(__name__)
@@ -41,7 +45,9 @@ class Evaluate(list[AACMetric], AACMetric[tuple[dict[str, Tensor], dict[str, Ten
     def __init__(
         self,
         preprocess: bool = True,
-        metrics: Union[str, Iterable[str], Iterable[AACMetric]] = "aac",
+        metrics: Union[
+            str, Iterable[str], Iterable[AACMetric]
+        ] = DEFAULT_METRICS_SET_NAME,
         cache_path: str = ...,
         java_path: str = ...,
         tmp_path: str = ...,
@@ -171,74 +177,79 @@ def _get_metric_factory_classes(
     tmp_path: str = ...,
     device: Union[str, torch.device, None] = "auto",
     verbose: int = 0,
+    init_kwds: dict[str, Any] = ...,
 ) -> dict[str, Callable[[], AACMetric]]:
-    return {
+    if init_kwds is ...:
+        init_kwds = {}
+
+    init_kwds = init_kwds | dict(return_all_scores=return_all_scores)
+
+    factory = {
         "bleu": lambda: BLEU(
-            return_all_scores=return_all_scores,
+            **init_kwds,
         ),
         "bleu_1": lambda: BLEU(
-            return_all_scores=return_all_scores,
             n=1,
+            **init_kwds,
         ),
         "bleu_2": lambda: BLEU(
-            return_all_scores=return_all_scores,
             n=2,
         ),
         "bleu_3": lambda: BLEU(
-            return_all_scores=return_all_scores,
             n=3,
+            **init_kwds,
         ),
         "bleu_4": lambda: BLEU(
-            return_all_scores=return_all_scores,
             n=4,
+            **init_kwds,
         ),
         "meteor": lambda: METEOR(
-            return_all_scores=return_all_scores,
             cache_path=cache_path,
             java_path=java_path,
             verbose=verbose,
+            **init_kwds,
         ),
         "rouge_l": lambda: ROUGEL(
-            return_all_scores=return_all_scores,
+            **init_kwds,
         ),
         "cider_d": lambda: CIDErD(
-            return_all_scores=return_all_scores,
+            **init_kwds,
         ),
         "spice": lambda: SPICE(
-            return_all_scores=return_all_scores,
             cache_path=cache_path,
             java_path=java_path,
             tmp_path=tmp_path,
             verbose=verbose,
+            **init_kwds,
         ),
         "spider": lambda: SPIDEr(
-            return_all_scores=return_all_scores,
             cache_path=cache_path,
             java_path=java_path,
             tmp_path=tmp_path,
             verbose=verbose,
+            **init_kwds,
         ),
         "sbert_sim": lambda: SBERTSim(
-            return_all_scores=return_all_scores,
             device=device,
             verbose=verbose,
+            **init_kwds,
         ),
         "fluerr": lambda: FluErr(
-            return_all_scores=return_all_scores,
             device=device,
             verbose=verbose,
         ),
         "fense": lambda: FENSE(
-            return_all_scores=return_all_scores,
             device=device,
             verbose=verbose,
+            **init_kwds,
         ),
         "spider_fl": lambda: SPIDErFL(
-            return_all_scores=return_all_scores,
             cache_path=cache_path,
             java_path=java_path,
             tmp_path=tmp_path,
             device=device,
             verbose=verbose,
+            **init_kwds,
         ),
     }
+    return factory