Version 0.1.1

Labbeti · Sep 30, 2022 · fc27b01 · fc27b01
1 parent 7cdd3ac
commit fc27b01
Show file tree

Hide file tree

Showing 32 changed files with 1,446 additions and 174 deletions.
diff --git a/.github/workflows/python-package-pip.yaml b/.github/workflows/python-package-pip.yaml
@@ -41,7 +41,7 @@ jobs:
           ${{ runner.os }}-
 
     - name: Install pip dev dependencies + package if needed
-      if: steps.cache_requirements.outputs.cache-hit == 'false'
+      if: steps.cache_requirements.outputs.cache-hit != 'true'
       run: |
         python -m pip install --upgrade pip
         python -m pip install -e .[dev]

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,18 @@
 
 All notable changes to this project will be documented in this file.
 
+## [0.1.1] 2022-09-30
+### Added
+- Documentation for metric functions and classes.
+- A second larger example for unit testing.
+
+### Changed
+- Update README information, references and description.
+
+### Fixed
+- SPIDEr-max computation with correct global and local outputs.
+- Unit testing for computing SPICE metric from caption-evaluation-tools.
+
 ## [0.1.0] 2022-09-28
 ### Added
 - BLEU, METEOR, ROUGE-l, SPICE, CIDEr and SPIDEr metrics functions and modules.

diff --git a/COCO_LICENCE b/COCO_LICENCE
@@ -0,0 +1,26 @@
+Copyright (c) 2015, Xinlei Chen, Hao Fang, Tsung-Yi Lin, and Ramakrishna Vedantam
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are met:
+
+1. Redistributions of source code must retain the above copyright notice, this
+   list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice,
+   this list of conditions and the following disclaimer in the documentation
+   and/or other materials provided with the distribution.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+The views and conclusions contained in the software and documentation are those
+of the authors and should not be interpreted as representing official policies,
+either expressed or implied, of the FreeBSD Project.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -4,3 +4,4 @@ global-exclude *.pyc
 global-exclude __pycache__
 
 include install_spice.sh
+recursive-include examples *.csv
diff --git a/README.md b/README.md
@@ -16,32 +16,50 @@ Audio Captioning metrics source code, designed for Pytorch.
 This package is a tool to evaluate sentences produced by automatic models to caption image or audio.
 The results of BLEU [1], ROUGE-L [2], METEOR [3], CIDEr [4], SPICE [5] and SPIDEr [6] are consistents with https://github.com/audio-captioning/caption-evaluation-tools.
 
+## Why using this package?
+- Easy installation with pip
+- Consistent with audio caption metrics code https://github.com/audio-captioning/caption-evaluation-tools
+- Provides functions and classes to compute metrics separately
+- Provides SPIDEr-max metric as described in the DCASE paper [7].
 
 ## Installation
 Install the pip package:
 ```
-pip install https://github.com/Labbeti/aac-metrics
+pip install aac-metrics
 ```
 
 Download the external code needed for METEOR, SPICE and PTBTokenizer:
 ```
 aac-metrics-download
 ```
 
-<!-- ## Why using this package?
-- Easy installation with pip
-- Consistent with audio caption metrics https://github.com/audio-captioning/caption-evaluation-tools
-- Removes code boilerplate inherited from python 2
-- Provides functions and classes to compute metrics separately -->
+Note: The external code for SPICE, METEOR and PTBTokenizer is stored in the cache directory (default: `$HOME/aac-metrics-cache/`)
+
+## Metrics
+### AAC metrics
+| Metric | Origin | Range | Short description |
+|:---:|:---:|:---:|:---:|
+| BLEU [1] | machine translation | [0, 1] | Precision of n-grams |
+| ROUGE-L [2] | machine translation | [0, 1] | FScore of the longest common subsequence |
+| METEOR [3] | machine translation | [0, 1] | Cosine-similarity of frequencies |
+| CIDEr-D [4] | image captioning | [0, 10] | Cosine-similarity of TF-IDF |
+| SPICE [5] | image captioning | [0, 1] | FScore of semantic graph |
+| SPIDEr [6] | image captioning | [0, 5.5] | Mean of CIDEr-D and SPICE |
+
+### Other metrics
+| Metric | Origin | Range | Short description |
+|:---:|:---:|:---:|:---:|
+| SPIDEr-max [7] | audio captioning | [0, 5.5] | Max of SPIDEr scores for multiples candidates |
 
-## Examples
+## Usage
+### Evaluate AAC metrics
+The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
 
-### Evaluate all metrics
 ```python
 from aac_metrics import aac_evaluate
 
-candidates = ["a man is speaking", ...]
-mult_references = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
+candidates: list[str] = ["a man is speaking", ...]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
 
 global_scores, _ = aac_evaluate(candidates, mult_references)
 print(global_scores)
@@ -50,11 +68,17 @@ print(global_scores)
 ```
 
 ### Evaluate a specific metric
+Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function. Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it before with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
+
 ```python
 from aac_metrics.functional import coco_cider_d
+from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
+
+candidates: list[str] = ["a man is speaking", ...]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
 
-candidates = [...]
-mult_references = [[...], ...]
+candidates = preprocess_mono_sents(candidates)
+mult_references = preprocess_mult_sents(mult_references)
 
 global_scores, local_scores = coco_cider_d(candidates, mult_references)
 print(global_scores)
@@ -63,18 +87,72 @@ print(local_scores)
 # {"cider_d": tensor([0.9, ...])}
 ```
 
-### Experimental SPIDEr-max metric
+Each metrics also exists as a python class version, like `aac_metrics.classes.coco_cider_d.CocoCIDErD`.
+
+## SPIDEr-max
+SPIDEr-max [7]  is a metric based on SPIDEr that takes into account multiple candidates for the same audio. It computes the maximum of the SPIDEr scores for each candidate to balance the high sensitivity to the frequency of the words generated by the model.
+
+### SPIDEr-max: why ?
+The SPIDEr metric used in audio captioning is highly sensitive to the frequencies of the words used.
+
+Here is few examples of candidates and references for 2 differents audios, with their associated SPIDEr score:
+
+| Candidates | SPIDEr |
+|:---|:---:|
+| heavy rain is falling on a roof | 0.562 |
+| heavy rain is falling on a **tin** roof | **0.930** |
+| a heavy rain is falling on a roof | 0.594 |
+| a heavy rain is falling on the ground | 0.335 |
+| a heavy rain is falling on the roof | 0.594 |
+
+| References |
+|:---|
+| heavy rain falls loudly onto a structure with a thin roof |
+| heavy rainfall falling onto a thin structure with a thin roof |
+| it is raining hard and the rain hits a tin roof |
+| rain that is pouring down very hard outside |
+| the hard rain is noisy as it hits a tin roof |
+
+(References for the Clotho development-testing file named "rain.wav")
+
+| Candidates | SPIDEr |
+|:---|:---:|
+| a woman speaks and a sheep bleats | 0.190 |
+| a woman speaks and a **goat** bleats | **1.259** |
+| a man speaks and a sheep bleats | 0.344 |
+| an adult male speaks and a sheep bleats | 0.231 |
+| an adult male is speaking and a sheep bleats | 0.189 |
+
+| References |
+|:---|
+| a man speaking and laughing followed by a goat bleat |
+| a man is speaking in high tone while a goat is bleating one time |
+| a man speaks followed by a goat bleat |
+| a person speaks and a goat bleats |
+| a man is talking and snickering followed by a goat bleating |
+
+(References for an AudioCaps testing file (id: "jid4t-FzUn0"))
+
+Even with very similar candidates, the SPIDEr scores varies drastically. To adress this issue, we proposed a SPIDEr-max metric which take the maximum value of several candidates for the same audio.
+
+### SPIDEr-max: usage
+This usage is very similar to other captioning metrics, with the main difference of take a multiple candidates list as input.
+
 ```python
 from aac_metrics.functional import spider_max
+from aac_metrics.utils.tokenization import preprocess_mult_sents
+
+mult_candidates: list[list[str]] = [["a man is speaking", "maybe someone speaking"], ...]
+mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
 
-mult_candidates = [[...], ...]
-mult_references = [[...], ...]
+mult_candidates = preprocess_mult_sents(mult_candidates)
+mult_references = preprocess_mult_sents(mult_references)
 
 global_scores, local_scores = spider_max(mult_candidates, mult_references)
 print(global_scores)
-# {"spider": tensor(0.1)}
+# {"spider": tensor(0.1), ...}
 print(local_scores)
-# {"spider": tensor([0.9, ...])}
+# {"spider": tensor([0.9, ...]), ...}
 ```
 
 ## Requirements
@@ -95,23 +173,14 @@ Most of these functions can specify a java executable path with `java_path` argu
 
 - `unzip` command to extract SPICE zipped files.
 
-## Metrics
 
-### Coco metrics
-| Metric | Origin | Range | Short description |
-|:---:|:---:|:---:|:---:|
-| BLEU [1] | machine translation | [0, 1] | Precision of n-grams |
-| ROUGE-L [2] | machine translation | [0, 1] | Longest common subsequence |
-| METEOR [3] | machine translation | [0, 1] | Cosine-similarity of frequencies |
-| CIDEr [4] | image captioning | [0, 10] | Cosine-similarity of TF-IDF |
-| SPICE [5] | image captioning | [0, 1] | FScore of semantic graph |
-| SPIDEr [6] | image captioning | [0, 5.5] | Mean of CIDEr and SPICE |
+## Additional notes
+### CIDEr or CIDEr-D ?
+The CIDEr [4] metric differs from CIDEr-D because it apply a stemmer to each words before computing the n-grams of the sentences. In AAC, only the CIDEr-D is reported and used for SPIDEr, but some papers called it "CIDEr".
 
-### Other metrics
-<!-- TODO : cite workshop paper for SPIDEr-max -->
-| Metric | Origin | Range | Short description |
-|:---:|:---:|:---:|:---:|
-| SPIDEr-max | audio captioning | [0, 5.5] | Max of multiples candidates SPIDEr scores |
+### Is torchmetrics needed for this package ?
+No. But if torchmetrics is installed, all metrics classes will inherit from the base class `torchmetrics.Metric`.
+It is because most of the metrics does not use PyTorch tensors to compute scores and numpy or string cannot be added to states of `torchmetrics.Metric`.
 
 ## References
 [1] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a
@@ -145,6 +214,11 @@ Captioning via Policy Gradient optimization of SPIDEr,” 2017 IEEE Inter-
 national Conference on Computer Vision (ICCV), pp. 873–881, Oct. 2017,
 arXiv: 1612.00370. [Online]. Available: http://arxiv.org/abs/1612.00370
 
+<!-- TODO : update ref -->
+Note: the following reference is **temporary**:
+
+[7] E. Labbe, T. Pellegrini, J. Pinquier, "IS MY AUTOMATIC AUDIO CAPTIONING SYSTEM SO BAD? SPIDEr-max: A METRIC TO CONSIDER SEVERAL CAPTION CANDIDATES", DCASE2022 Workshop.
+
 ## Cite the aac-metrics package
 The associated paper has been accepted but it will be published after the DCASE2022 workshop.