Version 0.4.5

Labbeti · Sep 12, 2023 · a5f056f · a5f056f
1 parent 228d77b
commit a5f056f
Show file tree

Hide file tree

Showing 34 changed files with 629 additions and 166 deletions.
diff --git a/.github/workflows/python-package-pip.yaml b/.github/workflows/python-package-pip.yaml
@@ -10,6 +10,7 @@ on:
 
 env:
   CACHE_NUMBER: 0  # increase to reset cache manually
+  TMPDIR: '/tmp'
 
 # Cancel workflow if a new push occurs
 concurrency:
@@ -49,8 +50,9 @@ jobs:
     - name: Install package
       shell: bash
       # note: ${GITHUB_REF##*/} gives the branch name
+      # note 2: dev is not the branch here, but the dev dependencies
       run: |
-        python -m pip install "aac-metrics[${GITHUB_REF_NAME}] @ git+https://github.com/Labbeti/aac-metrics@${GITHUB_REF##*/}"
+        python -m pip install "aac-metrics[dev] @ git+https://github.com/Labbeti/aac-metrics@${GITHUB_REF##*/}"
   
     - name: Load cache of external code and data
       uses: actions/cache@master

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,17 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.4.5] 2023-09-12
+### Added
+- Argument `use_shell` for `METEOR` and `SPICE` metrics and `download` function to fix Windows-OS specific error.
+
+### Changed
+- Rename `evaluate.py` script to `eval.py`.
+
+### Fixed
+- Workflow on main branch.
+- Examples in README and doc with at least 2 sentences, and add a warning on all metrics that requires at least 2 candidates.
+
 ## [0.4.4] 2023-08-14
 ### Added
 - `Evaluate` class now implements a `__hash__` and `tolist()` methods.

diff --git a/CITATION.cff b/CITATION.cff
@@ -19,5 +19,5 @@ keywords:
   - captioning
   - audio-captioning
 license: MIT
-version: 0.4.4
-date-released: '2023-08-14'
+version: 0.4.5
+date-released: '2023-09-12'
diff --git a/README.md b/README.md
@@ -49,8 +49,8 @@ aac-metrics-download
 ```
 
 Notes:
-- The external code for SPICE, METEOR and PTBTokenizer is stored in `$HOME/.cache/aac-metrics`.
-- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`.
+- The external code for SPICE, METEOR and PTBTokenizer is stored in `~/.cache/aac-metrics`.
+- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `~/.cache/torch/hub/fense_data` and `~/.cache/torch/sentence_transformers`.
 
 ## Usage
 ### Evaluate default metrics
@@ -59,13 +59,13 @@ The full evaluation pipeline to compute AAC metrics can be done with `aac_metric
 ```python
 from aac_metrics import evaluate
 
-candidates: list[str] = ["a man is speaking"]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
+candidates: list[str] = ["a man is speaking", "rain falls"]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
 
 corpus_scores, _ = evaluate(candidates, mult_references)
 print(corpus_scores)
 # dict containing the score of each metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
-# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
+# {"bleu_1": tensor(0.4278), "bleu_2": ..., ...}
 ```
 ### Evaluate DCASE2023 metrics
 To compute metrics for the DCASE2023 challenge, just set the argument `metrics="dcase2023"` in `evaluate` function call.
@@ -83,17 +83,17 @@ Evaluate a specific metric can be done using the `aac_metrics.functional.<metric
 from aac_metrics.functional import cider_d
 from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
 
-candidates: list[str] = ["a man is speaking"]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
+candidates: list[str] = ["a man is speaking", "rain falls"]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
 
 candidates = preprocess_mono_sents(candidates)
 mult_references = preprocess_mult_sents(mult_references)
 
 corpus_scores, sents_scores = cider_d(candidates, mult_references)
 print(corpus_scores)
-# {"cider_d": tensor(0.1)}
+# {"cider_d": tensor(0.9614)}
 print(sents_scores)
-# {"cider_d": tensor([0.9, ...])}
+# {"cider_d": tensor([1.3641, 0.5587])}
 ```
 
 Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
@@ -119,7 +119,8 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
 | SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error |
 
 ## Requirements
-This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions.
+This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported.
+
 ### Python packages
 
 
@@ -130,6 +131,7 @@ numpy >= 1.21.2
 pyyaml >= 6.0
 tqdm >= 4.64.0
 sentence-transformers >= 2.2.2
+transformers < 4.31.0
 ```
 
 ### External requirements
@@ -215,10 +217,10 @@ If you use this software, please consider cite it as below :
     Labbe_aac-metrics_2023,
     author = {Labbé, Etienne},
     license = {MIT},
-    month = {8},
+    month = {9},
     title = {{aac-metrics}},
     url = {https://github.com/Labbeti/aac-metrics/},
-    version = {0.4.4},
+    version = {0.4.5},
     year = {2023},
 }
 ```

diff --git a/docs/aac_metrics.eval.rst b/docs/aac_metrics.eval.rst
@@ -0,0 +1,7 @@
+aac\_metrics.eval module
+========================
+
+.. automodule:: aac_metrics.eval
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/usage.rst b/docs/usage.rst
@@ -4,19 +4,19 @@ Usage
 Evaluate default AAC metrics
 ############################
 
-The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
+The full evaluation process to compute AAC metrics can be done with `aac_metrics.dcase2023_evaluate` function.
 
 .. code-block:: python
 
-    from aac_metrics import aac_evaluate
+    from aac_metrics import evaluate
 
-    candidates: list[str] = ["a man is speaking", ...]
-    mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+    candidates: list[str] = ["a man is speaking", "rain falls"]
+    mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
 
-    corpus_scores, _ = aac_evaluate(candidates, mult_references)
+    corpus_scores, _ = evaluate(candidates, mult_references)
     print(corpus_scores)
     # dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
-    # {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
+    # {"bleu_1": tensor(0.4278), "bleu_2": ..., ...}
 
 
 Evaluate a specific metric
@@ -25,24 +25,24 @@ Evaluate a specific metric
 Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class.
 
 .. warning::
-    Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
+    Unlike `dcase2023_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
 
 .. code-block:: python
-    
+
     from aac_metrics.functional import cider_d
     from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
 
-    candidates: list[str] = ["a man is speaking", ...]
-    mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+    candidates: list[str] = ["a man is speaking", "rain falls"]
+    mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
 
     candidates = preprocess_mono_sents(candidates)
     mult_references = preprocess_mult_sents(mult_references)
 
     corpus_scores, sents_scores = cider_d(candidates, mult_references)
     print(corpus_scores)
-    # {"cider_d": tensor(0.1)}
+    # {"cider_d": tensor(0.9614)}
     print(sents_scores)
-    # {"cider_d": tensor([0.9, ...])}
+    # {"cider_d": tensor([1.3641, 0.5587])}
 
 
 Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
diff --git a/pyproject.toml b/pyproject.toml
@@ -39,7 +39,7 @@ Changelog = "https://github.com/Labbeti/aac-metrics/blob/main/CHANGELOG.md"
 [project.scripts]
 aac-metrics = "aac_metrics.__main__:_print_usage"
 aac-metrics-download = "aac_metrics.download:_main_download"
-aac-metrics-evaluate = "aac_metrics.evaluate:_main_evaluate"
+aac-metrics-eval = "aac_metrics.eval:_main_eval"
 aac-metrics-info = "aac_metrics.info:print_install_info"
 
 [project.optional-dependencies]

diff --git a/src/aac_metrics/__init__.py b/src/aac_metrics/__init__.py
@@ -10,7 +10,7 @@
 __license__ = "MIT"
 __maintainer__ = "Etienne Labbé (Labbeti)"
 __status__ = "Development"
-__version__ = "0.4.4"
+__version__ = "0.4.5"
 
 
 from .classes.base import AACMetric
@@ -65,9 +65,10 @@ def load_metric(name: str, **kwargs) -> AACMetric:
     name = name.lower().strip()
 
     factory = _get_metric_factory_classes(**kwargs)
-    if name in factory:
-        return factory[name]()
-    else:
+    if name not in factory:
         raise ValueError(
             f"Invalid argument {name=}. (expected one of {tuple(factory.keys())})"
         )
+
+    metric = factory[name]()
+    return metric
diff --git a/src/aac_metrics/classes/bleu.py b/src/aac_metrics/classes/bleu.py
@@ -53,12 +53,13 @@ def __init__(
 
     def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
         return _bleu_compute(
-            self._cooked_cands,
-            self._cooked_mrefs,
-            self._return_all_scores,
-            self._n,
-            self._option,
-            self._verbose,
+            cooked_cands=self._cooked_cands,
+            cooked_mrefs=self._cooked_mrefs,
+            return_all_scores=self._return_all_scores,
+            n=self._n,
+            option=self._option,
+            verbose=self._verbose,
+            return_1_to_n=False,
         )
 
     def extra_repr(self) -> str:

diff --git a/src/aac_metrics/classes/cider_d.py b/src/aac_metrics/classes/cider_d.py
@@ -49,13 +49,13 @@ def __init__(
 
     def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
         return _cider_d_compute(
-            self._cooked_cands,
-            self._cooked_mrefs,
-            self._return_all_scores,
-            self._n,
-            self._sigma,
-            self._return_tfidf,
-            self._scale,
+            cooked_cands=self._cooked_cands,
+            cooked_mrefs=self._cooked_mrefs,
+            return_all_scores=self._return_all_scores,
+            n=self._n,
+            sigma=self._sigma,
+            return_tfidf=self._return_tfidf,
+            scale=self._scale,
         )
 
     def extra_repr(self) -> str:
@@ -75,10 +75,10 @@ def update(
         mult_references: list[list[str]],
     ) -> None:
         self._cooked_cands, self._cooked_mrefs = _cider_d_update(
-            candidates,
-            mult_references,
-            self._n,
-            self._tokenizer,
-            self._cooked_cands,
-            self._cooked_mrefs,
+            candidates=candidates,
+            mult_references=mult_references,
+            n=self._n,
+            tokenizer=self._tokenizer,
+            prev_cooked_cands=self._cooked_cands,
+            prev_cooked_mrefs=self._cooked_mrefs,
         )
diff --git a/src/aac_metrics/classes/evaluate.py b/src/aac_metrics/classes/evaluate.py
@@ -108,7 +108,7 @@ def __hash__(self) -> int:
 class DCASE2023Evaluate(Evaluate):
     """Evaluate candidates with multiple references with DCASE2023 Audio Captioning metrics.
 
-    For more information, see :func:`~aac_metrics.functional.evaluate.aac_evaluate`.
+    For more information, see :func:`~aac_metrics.functional.evaluate.dcase2023_evaluate`.
     """
 
     def __init__(

diff --git a/src/aac_metrics/classes/fense.py b/src/aac_metrics/classes/fense.py
@@ -42,7 +42,7 @@ def __init__(
         device: Union[str, torch.device, None] = "auto",
         batch_size: int = 32,
         reset_state: bool = True,
-        return_probs: bool = True,
+        return_probs: bool = False,
         penalty: float = 0.9,
         verbose: int = 0,
     ) -> None:
@@ -66,19 +66,19 @@ def __init__(
 
     def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
         return fense(
-            self._candidates,
-            self._mult_references,
-            self._return_all_scores,
-            self._sbert_model,
-            self._echecker,
-            self._echecker_tokenizer,
-            self._error_threshold,
-            self._device,
-            self._batch_size,
-            self._reset_state,
-            self._return_probs,
-            self._penalty,
-            self._verbose,
+            candidates=self._candidates,
+            mult_references=self._mult_references,
+            return_all_scores=self._return_all_scores,
+            sbert_model=self._sbert_model,
+            echecker=self._echecker,
+            echecker_tokenizer=self._echecker_tokenizer,
+            error_threshold=self._error_threshold,
+            device=self._device,
+            batch_size=self._batch_size,
+            reset_state=self._reset_state,
+            return_probs=self._return_probs,
+            penalty=self._penalty,
+            verbose=self._verbose,
         )
 
     def extra_repr(self) -> str:

diff --git a/src/aac_metrics/classes/fluerr.py b/src/aac_metrics/classes/fluerr.py
@@ -44,7 +44,7 @@ def __init__(
         device: Union[str, torch.device, None] = "auto",
         batch_size: int = 32,
         reset_state: bool = True,
-        return_probs: bool = True,
+        return_probs: bool = False,
         verbose: int = 0,
     ) -> None:
         echecker, echecker_tokenizer = _load_echecker_and_tokenizer(echecker, None, device, reset_state, verbose)  # type: ignore
@@ -64,16 +64,16 @@ def __init__(
 
     def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
         return fluerr(
-            self._candidates,
-            self._return_all_scores,
-            self._echecker,
-            self._echecker_tokenizer,
-            self._error_threshold,
-            self._device,
-            self._batch_size,
-            self._reset_state,
-            self._return_probs,
-            self._verbose,
+            candidates=self._candidates,
+            return_all_scores=self._return_all_scores,
+            echecker=self._echecker,
+            echecker_tokenizer=self._echecker_tokenizer,
+            error_threshold=self._error_threshold,
+            device=self._device,
+            batch_size=self._batch_size,
+            reset_state=self._reset_state,
+            return_probs=self._return_probs,
+            verbose=self._verbose,
         )
 
     def extra_repr(self) -> str: