Version 0.3.0

Labbeti · Feb 27, 2023 · 056b048 · 056b048
1 parent dd8dff3
commit 056b048
Show file tree

Hide file tree

Showing 97 changed files with 2,289 additions and 2,343 deletions.
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -0,0 +1,25 @@
+# .readthedocs.yaml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Set the version of Python and other tools you might need
+build:
+  os: ubuntu-20.04
+  tools:
+    python: "3.9"
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+   configuration: docs/conf.py
+
+# If using Sphinx, optionally build your docs in additional formats such as PDF
+# formats:
+#    - pdf
+python:
+   install:
+    - method: pip
+      path: .
+    - requirements: docs/requirements.txt
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,22 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.3.0] 2023-02-27
+### Added
+- Parameters `timeout` and `separate_cache_dir` in `SPICE` function and class.
+- Documentation pages with sphinx.
+- Parameter `language` in `METEOR` function and class.
+- Options to download only `PTBTokenizer`, `METEOR`, `SPICE` or `FENSE` in `download.py`.
+- `SBERT` and `FluencyError` metrics extracted from `FENSE`.
+- `SPIDErErr` metric which combines `SPIDEr` with `FluencyError`.
+- Parameter `reset_state` in `SBERT`, `FluencyError`, `SPIDErErr` and `FENSE` functions and classes.
+
+### Changed
+- Fix README typo and SPIDEr-max tables.
+
+### Fixed
+- Workflow badge with Github changes. (https://github.com/badges/shields/issues/8671)
+
 ## [0.2.0] 2022-12-14
 ### Added
 - `FENSE` class and function metric, with fluency error rate and raw output probabilities.

diff --git a/README.md b/README.md
@@ -5,63 +5,57 @@
 # Audio Captioning metrics (aac-metrics)
 
 <a href="https://www.python.org/"><img alt="Python" src="https://img.shields.io/badge/-Python 3.9+-blue?style=for-the-badge&logo=python&logoColor=white"></a>
-<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/-PyTorch 1.10.1-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"></a>
+<a href="https://pytorch.org/get-started/locally/"><img alt="PyTorch" src="https://img.shields.io/badge/-PyTorch 1.10.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white"></a>
 <a href="https://black.readthedocs.io/en/stable/"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-black.svg?style=for-the-badge&labelColor=gray"></a>
-<a href="https://github.com/Labbeti/aac-metrics/actions"><img alt="Build" src="https://img.shields.io/github/workflow/status/Labbeti/aac-metrics/Python%20package%20using%20Pip/main?style=for-the-badge&logo=github"></a>
+<a href="https://github.com/Labbeti/aac-metrics/actions"><img alt="Build" src="https://img.shields.io/github/actions/workflow/status/Labbeti/aac-metrics/python-package-pip.yaml?branch=main&style=for-the-badge&logo=github"></a>
+<a href='https://aac-metrics.readthedocs.io/en/stable/?badge=stable'>
+    <img src='https://readthedocs.org/projects/aac-metrics/badge/?version=stable&style=for-the-badge' alt='Documentation Status' />
+</a>
 
-Audio Captioning metrics source code, designed for Pytorch.
+Metrics for evaluating Automated Audio Captioning systems, designed for PyTorch.
 
 </div>
 
-This package is a tool to evaluate sentences produced by automated captioning systems.
-The results are the same than BLEU [[1]](#bleu), ROUGE-L [[2]](#rouge-l), METEOR [[3]](#meteor), CIDEr-D [[4]](#cider), SPICE [[5]](#spice) and SPIDEr [[6]](#spider) in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and tahn FENSE [[8]](#fense) in [fense](https://github.com/blmoistawinde/fense).
-
 ## Why using this package?
-- Easy installation with pip
-- Provides functions and classes to compute metrics separately
-- Returns torch tensors directly
-- Same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense)
-- Provides SPIDEr-max [[7]](#spider-max) and FENSE [[8]](#fense) metrics
+- **Easy installation and download**
+- **Same results than [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools) and [fense](https://github.com/blmoistawinde/fense) repositories**
+- **Provides the following metrics:**
+    - BLEU [[1]](#bleu)
+    - ROUGE-L [[2]](#rouge-l)
+    - METEOR [[3]](#meteor)
+    - CIDEr-D [[4]](#cider)
+    - SPICE [[5]](#spice)
+    - SPIDEr [[6]](#spider)
+    - SPIDEr-max [[7]](#spider-max)
+    - SBERT [[8]](#fense)
+    - FluencyError [[8]](#fense)
+    - FENSE [[8]](#fense)
+    - SPIDErErr
 
 ## Installation
 Install the pip package:
 ```bash
 pip install aac-metrics
 ```
 
-Download the external code needed for METEOR, SPICE and PTBTokenizer:
+Download the external code and models needed for METEOR, SPICE, PTBTokenizer and FENSE:
 ```bash
 aac-metrics-download
 ```
 
-Note: The external code for SPICE, METEOR and PTBTokenizer is stored in the cache directory (default: `$HOME/.cache/aac-metrics/`)
-
-## Metrics
-### Default AAC metrics
-| Metric | Python Class | Origin | Range | Short description |
-|:---|:---|:---|:---|:---|
-| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
-| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | machine translation | [0, 1] | FScore of the longest common subsequence |
-| METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies |
-| CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams |
-| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph |
-| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
-
-### Other metrics
-| Metric name | Python Class | Origin | Range | Short description |
-|:---|:---|:---|:---|:---|
-| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
-| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** combined with fluency error detector |
+Notes:
+- The external code for SPICE, METEOR and PTBTokenizer is stored in `$HOME/.cache/aac-metrics`.
+- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`.
 
 ## Usage
-### Evaluate AAC metrics
+### Evaluate default AAC metrics
 The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
 
 ```python
 from aac_metrics import aac_evaluate
 
-candidates: list[str] = ["a man is speaking", ...]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+candidates: list[str] = ["a man is speaking"]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
 
 corpus_scores, _ = aac_evaluate(candidates, mult_references)
 print(corpus_scores)
@@ -76,8 +70,8 @@ Evaluate a specific metric can be done using the `aac_metrics.functional.<metric
 from aac_metrics.functional import cider_d
 from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
 
-candidates: list[str] = ["a man is speaking", ...]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+candidates: list[str] = ["a man is speaking"]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
 
 candidates = preprocess_mono_sents(candidates)
 mult_references = preprocess_mult_sents(mult_references)
@@ -91,6 +85,26 @@ print(sents_scores)
 
 Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
 
+## Metrics
+### Default AAC metrics
+| Metric | Python Class | Origin | Range | Short description |
+|:---|:---|:---|:---|:---|
+| BLEU [[1]](#bleu) | `BLEU` | machine translation | [0, 1] | Precision of n-grams |
+| ROUGE-L [[2]](#rouge-l) | `ROUGEL` | machine translation | [0, 1] | FScore of the longest common subsequence |
+| METEOR [[3]](#meteor) | `METEOR` | machine translation | [0, 1] | Cosine-similarity of frequencies with synonyms matching |
+| CIDEr-D [[4]](#cider) | `CIDErD` | image captioning | [0, 10] | Cosine-similarity of TF-IDF computed on n-grams |
+| SPICE [[5]](#spice) | `SPICE` | image captioning | [0, 1] | FScore of semantic graph |
+| SPIDEr [[6]](#spider) | `SPIDEr` | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
+
+### Other metrics
+| Metric name | Python Class | Origin | Range | Short description |
+|:---|:---|:---|:---|:---|
+| SPIDEr-max [[7]](#spider-max) | `SPIDErMax` | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
+| SBERT [[7]](#spider-max) | `SBERT` | audio captioning | [-1, 1] | Cosine-similarity of **Sentence-BERT embeddings** |
+| FluencyError [[7]](#spider-max) | `FluencyError` | audio captioning | [0, 1] | Use pretrained model to detect fluency errors in sentences |
+| FENSE [[8]](#fense) | `FENSE` | audio captioning | [-1, 1] | Combines `SBERT` and `FluencyError` |
+| SPIDErErr | `SPIDErErr` | audio captioning | [0, 5.5] | Combines `SPIDEr` and `FluencyError` |
+
 ## SPIDEr-max metric
 SPIDEr-max [[7]](#spider-max)  is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.
 
@@ -99,7 +113,7 @@ The SPIDEr metric used in audio captioning is highly sensitive to the frequencie
 
 Here is 2 examples with the 5 candidates generated by the beam search algorithm, their corresponding SPIDEr scores and the associated references:
 
-<center>
+<div align="center">
 
 | Beam search candidates | SPIDEr |
 |:---|:---:|
@@ -136,10 +150,9 @@ _(Candidates and references for the Clotho development-testing file named "rain.
 | a man is talking and snickering followed by a goat bleating |
 
 _(Candidates and references for an AudioCaps testing file with the id "jid4t-FzUn0")_
+</div>
 
-</center>
-
-Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio.
+Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio. SPIDEr-max demonstrate that SPIDEr can exceed state-of-the-art scores on AudioCaps and Clotho and even human scores on AudioCaps [[7]](#spider-max).
 
 ### SPIDEr-max: usage
 This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.
@@ -148,8 +161,8 @@ This usage is very similar to other captioning metrics, with the main difference
 from aac_metrics.functional import spider_max
 from aac_metrics.utils.tokenization import preprocess_mult_sents
 
-mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"], ...]
-mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"]]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
 
 mult_candidates = preprocess_mult_sents(mult_candidates)
 mult_references = preprocess_mult_sents(mult_references)
@@ -174,7 +187,6 @@ sentence-transformers>=2.2.2
 ```
 
 ### External requirements
-
 - `java` >= 1.8 is required to compute METEOR, SPICE and use the PTBTokenizer.
 Most of these functions can specify a java executable path with `java_path` argument.
 
@@ -183,9 +195,9 @@ Most of these functions can specify a java executable path with `java_path` argu
 
 ## Additional notes
 ### CIDEr or CIDEr-D ?
-The CIDEr [4] metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr, but some papers called it "CIDEr".
+The CIDEr metric differs from CIDEr-D because it applies a stemmer to each word before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr in [caption-evaluation-tools](https://github.com/audio-captioning/caption-evaluation-tools), but some papers called it "CIDEr".
 
-### Does metric work on multi-GPU ?
+### Does metrics work on multi-GPU ?
 No. Most of these metrics use numpy or external java programs to run, which prevents multi-GPU testing for now.
 
 ### Is torchmetrics needed for this package ?
@@ -201,7 +213,7 @@ Linguistics - ACL ’02. Philadelphia, Pennsylvania: Association
 for Computational Linguistics, 2001, p. 311. [Online]. Available:
 http://portal.acm.org/citation.cfm?doid=1073083.1073135
 
-#### Rouge-L
+#### ROUGE-L
 [2] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,”
 in Text Summarization Branches Out. Barcelona, Spain: Association
 for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
@@ -237,20 +249,20 @@ arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
 [8] Z. Zhou, Z. Zhang, X. Xu, Z. Xie, M. Wu, and K. Q. Zhu, Can Audio Captions Be Evaluated with Image Caption Metrics? arXiv, 2022. [Online]. Available: http://arxiv.org/abs/2110.04684 
 
 ## Citation
-If you use **SPIDEr-max**, you can cite the following paper:
+If you use **SPIDEr-max**, you can cite the following paper using BibTex:
 ```
 @inproceedings{labbe:hal-03810396,
-  TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}},
-  AUTHOR = {Labb{\'e}, Etienne and Pellegrini, Thomas and Pinquier, Julien},
-  URL = {https://hal.archives-ouvertes.fr/hal-03810396},
-  BOOKTITLE = {{Workshop DCASE}},
-  ADDRESS = {Nancy, France},
-  YEAR = {2022},
-  MONTH = Nov,
-  KEYWORDS = {audio captioning ; evaluation metric ; beam search ; multiple candidates},
-  PDF = {https://hal.archives-ouvertes.fr/hal-03810396/file/Labbe_DCASE2022.pdf},
-  HAL_ID = {hal-03810396},
-  HAL_VERSION = {v1},
+    TITLE = {{Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates}},
+    AUTHOR = {Labb{\'e}, Etienne and Pellegrini, Thomas and Pinquier, Julien},
+    URL = {https://hal.archives-ouvertes.fr/hal-03810396},
+    BOOKTITLE = {{Workshop DCASE}},
+    ADDRESS = {Nancy, France},
+    YEAR = {2022},
+    MONTH = Nov,
+    KEYWORDS = {audio captioning ; evaluation metric ; beam search ; multiple candidates},
+    PDF = {https://hal.archives-ouvertes.fr/hal-03810396/file/Labbe_DCASE2022.pdf},
+    HAL_ID = {hal-03810396},
+    HAL_VERSION = {v1},
 }
 ```
 

diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?=
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/_static/css/custom.css b/docs/_static/css/custom.css
@@ -0,0 +1,21 @@
+
+/* --- */
+/* Add linebreaks for arguments from https://github.com/sphinx-doc/sphinx/issues/1514#issuecomment-742703082 */
+
+/* Newlines (\a) and spaces (\20) before each parameter */
+.sig-param::before {
+    content: "\a\20\20\20\20\20\20\20\20\20\20\20\20\20\20\20\20";
+    white-space: pre;
+}
+
+/* Newline after the last parameter (so the closing bracket is on a new line) */
+dt em.sig-param:last-of-type::after {
+    content: "\a";
+    white-space: pre;
+}
+
+/* To have blue background of width of the block (instead of width of content) */
+dl.class > dt:first-of-type {
+    display: block !important;
+}
+/* --- */
diff --git a/docs/aac_metrics.classes.base.rst b/docs/aac_metrics.classes.base.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.base module
+================================
+
+.. automodule:: aac_metrics.classes.base
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.bleu.rst b/docs/aac_metrics.classes.bleu.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.bleu module
+================================
+
+.. automodule:: aac_metrics.classes.bleu
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.cider_d.rst b/docs/aac_metrics.classes.cider_d.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.cider\_d module
+====================================
+
+.. automodule:: aac_metrics.classes.cider_d
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.evaluate.rst b/docs/aac_metrics.classes.evaluate.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.evaluate module
+====================================
+
+.. automodule:: aac_metrics.classes.evaluate
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.fense.rst b/docs/aac_metrics.classes.fense.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.fense module
+=================================
+
+.. automodule:: aac_metrics.classes.fense
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.fluency_error.rst b/docs/aac_metrics.classes.fluency_error.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.fluency\_error module
+==========================================
+
+.. automodule:: aac_metrics.classes.fluency_error
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.meteor.rst b/docs/aac_metrics.classes.meteor.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.meteor module
+==================================
+
+.. automodule:: aac_metrics.classes.meteor
+   :members:
+   :undoc-members:
+   :show-inheritance:
diff --git a/docs/aac_metrics.classes.rouge_l.rst b/docs/aac_metrics.classes.rouge_l.rst
@@ -0,0 +1,7 @@
+aac\_metrics.classes.rouge\_l module
+====================================
+
+.. automodule:: aac_metrics.classes.rouge_l
+   :members:
+   :undoc-members:
+   :show-inheritance: