Skip to content

Commit

Permalink
Version 0.4.5
Browse files Browse the repository at this point in the history
  • Loading branch information
Labbeti committed Sep 12, 2023
1 parent 228d77b commit a5f056f
Show file tree
Hide file tree
Showing 34 changed files with 629 additions and 166 deletions.
4 changes: 3 additions & 1 deletion .github/workflows/python-package-pip.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ on:

env:
CACHE_NUMBER: 0 # increase to reset cache manually
TMPDIR: '/tmp'

# Cancel workflow if a new push occurs
concurrency:
Expand Down Expand Up @@ -49,8 +50,9 @@ jobs:
- name: Install package
shell: bash
# note: ${GITHUB_REF##*/} gives the branch name
# note 2: dev is not the branch here, but the dev dependencies
run: |
python -m pip install "aac-metrics[${GITHUB_REF_NAME}] @ git+https://github.com/Labbeti/aac-metrics@${GITHUB_REF##*/}"
python -m pip install "aac-metrics[dev] @ git+https://github.com/Labbeti/aac-metrics@${GITHUB_REF##*/}"
- name: Load cache of external code and data
uses: actions/cache@master
Expand Down
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,17 @@

All notable changes to this project will be documented in this file.

## [0.4.5] 2023-09-12
### Added
- Argument `use_shell` for `METEOR` and `SPICE` metrics and `download` function to fix Windows-OS specific error.

### Changed
- Rename `evaluate.py` script to `eval.py`.

### Fixed
- Workflow on main branch.
- Examples in README and doc with at least 2 sentences, and add a warning on all metrics that requires at least 2 candidates.

## [0.4.4] 2023-08-14
### Added
- `Evaluate` class now implements a `__hash__` and `tolist()` methods.
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ keywords:
- captioning
- audio-captioning
license: MIT
version: 0.4.4
date-released: '2023-08-14'
version: 0.4.5
date-released: '2023-09-12'
26 changes: 14 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ aac-metrics-download
```

Notes:
- The external code for SPICE, METEOR and PTBTokenizer is stored in `$HOME/.cache/aac-metrics`.
- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `$HOME/.cache/torch/hub/fense_data` and `$HOME/.cache/torch/sentence_transformers`.
- The external code for SPICE, METEOR and PTBTokenizer is stored in `~/.cache/aac-metrics`.
- The weights of the FENSE fluency error detector and the the SBERT model are respectively stored by default in `~/.cache/torch/hub/fense_data` and `~/.cache/torch/sentence_transformers`.

## Usage
### Evaluate default metrics
Expand All @@ -59,13 +59,13 @@ The full evaluation pipeline to compute AAC metrics can be done with `aac_metric
```python
from aac_metrics import evaluate

candidates: list[str] = ["a man is speaking"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
candidates: list[str] = ["a man is speaking", "rain falls"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]

corpus_scores, _ = evaluate(candidates, mult_references)
print(corpus_scores)
# dict containing the score of each metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
# {"bleu_1": tensor(0.4278), "bleu_2": ..., ...}
```
### Evaluate DCASE2023 metrics
To compute metrics for the DCASE2023 challenge, just set the argument `metrics="dcase2023"` in `evaluate` function call.
Expand All @@ -83,17 +83,17 @@ Evaluate a specific metric can be done using the `aac_metrics.functional.<metric
from aac_metrics.functional import cider_d
from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents

candidates: list[str] = ["a man is speaking"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"]]
candidates: list[str] = ["a man is speaking", "rain falls"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]

candidates = preprocess_mono_sents(candidates)
mult_references = preprocess_mult_sents(mult_references)

corpus_scores, sents_scores = cider_d(candidates, mult_references)
print(corpus_scores)
# {"cider_d": tensor(0.1)}
# {"cider_d": tensor(0.9614)}
print(sents_scores)
# {"cider_d": tensor([0.9, ...])}
# {"cider_d": tensor([1.3641, 0.5587])}
```

Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
Expand All @@ -119,7 +119,8 @@ Each metrics also exists as a python class version, like `aac_metrics.classes.ci
| SPIDEr-FL [[9]](#spider-fl) | `SPIDErFL` | audio captioning | [0, 5.5] | Combines SPIDEr and Fluency Error |

## Requirements
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions.
This package has been developped for Ubuntu 20.04, and it is expected to work on most Linux distributions. Windows is not officially supported.

### Python packages


Expand All @@ -130,6 +131,7 @@ numpy >= 1.21.2
pyyaml >= 6.0
tqdm >= 4.64.0
sentence-transformers >= 2.2.2
transformers < 4.31.0
```

### External requirements
Expand Down Expand Up @@ -215,10 +217,10 @@ If you use this software, please consider cite it as below :
Labbe_aac-metrics_2023,
author = {Labbé, Etienne},
license = {MIT},
month = {8},
month = {9},
title = {{aac-metrics}},
url = {https://github.com/Labbeti/aac-metrics/},
version = {0.4.4},
version = {0.4.5},
year = {2023},
}
```
Expand Down
7 changes: 7 additions & 0 deletions docs/aac_metrics.eval.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
aac\_metrics.eval module
========================

.. automodule:: aac_metrics.eval
:members:
:undoc-members:
:show-inheritance:
24 changes: 12 additions & 12 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,19 @@ Usage
Evaluate default AAC metrics
############################

The full evaluation process to compute AAC metrics can be done with `aac_metrics.aac_evaluate` function.
The full evaluation process to compute AAC metrics can be done with `aac_metrics.dcase2023_evaluate` function.

.. code-block:: python
from aac_metrics import aac_evaluate
from aac_metrics import evaluate
candidates: list[str] = ["a man is speaking", ...]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
candidates: list[str] = ["a man is speaking", "rain falls"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
corpus_scores, _ = aac_evaluate(candidates, mult_references)
corpus_scores, _ = evaluate(candidates, mult_references)
print(corpus_scores)
# dict containing the score of each aac metric: "bleu_1", "bleu_2", "bleu_3", "bleu_4", "rouge_l", "meteor", "cider_d", "spice", "spider"
# {"bleu_1": tensor(0.7), "bleu_2": ..., ...}
# {"bleu_1": tensor(0.4278), "bleu_2": ..., ...}
Evaluate a specific metric
Expand All @@ -25,24 +25,24 @@ Evaluate a specific metric
Evaluate a specific metric can be done using the `aac_metrics.functional.<metric_name>.<metric_name>` function or the `aac_metrics.classes.<metric_name>.<metric_name>` class.

.. warning::
Unlike `aac_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.
Unlike `dcase2023_evaluate`, the tokenization with PTBTokenizer is not done with these functions, but you can do it manually with `preprocess_mono_sents` and `preprocess_mult_sents` functions.

.. code-block:: python
from aac_metrics.functional import cider_d
from aac_metrics.utils.tokenization import preprocess_mono_sents, preprocess_mult_sents
candidates: list[str] = ["a man is speaking", ...]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ...]
candidates: list[str] = ["a man is speaking", "rain falls"]
mult_references: list[list[str]] = [["a man speaks.", "someone speaks.", "a man is speaking while a bird is chirping in the background"], ["rain is falling hard on a surface"]]
candidates = preprocess_mono_sents(candidates)
mult_references = preprocess_mult_sents(mult_references)
corpus_scores, sents_scores = cider_d(candidates, mult_references)
print(corpus_scores)
# {"cider_d": tensor(0.1)}
# {"cider_d": tensor(0.9614)}
print(sents_scores)
# {"cider_d": tensor([0.9, ...])}
# {"cider_d": tensor([1.3641, 0.5587])}
Each metrics also exists as a python class version, like `aac_metrics.classes.cider_d.CIDErD`.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Changelog = "https://github.com/Labbeti/aac-metrics/blob/main/CHANGELOG.md"
[project.scripts]
aac-metrics = "aac_metrics.__main__:_print_usage"
aac-metrics-download = "aac_metrics.download:_main_download"
aac-metrics-evaluate = "aac_metrics.evaluate:_main_evaluate"
aac-metrics-eval = "aac_metrics.eval:_main_eval"
aac-metrics-info = "aac_metrics.info:print_install_info"

[project.optional-dependencies]
Expand Down
9 changes: 5 additions & 4 deletions src/aac_metrics/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
__license__ = "MIT"
__maintainer__ = "Etienne Labbé (Labbeti)"
__status__ = "Development"
__version__ = "0.4.4"
__version__ = "0.4.5"


from .classes.base import AACMetric
Expand Down Expand Up @@ -65,9 +65,10 @@ def load_metric(name: str, **kwargs) -> AACMetric:
name = name.lower().strip()

factory = _get_metric_factory_classes(**kwargs)
if name in factory:
return factory[name]()
else:
if name not in factory:
raise ValueError(
f"Invalid argument {name=}. (expected one of {tuple(factory.keys())})"
)

metric = factory[name]()
return metric
13 changes: 7 additions & 6 deletions src/aac_metrics/classes/bleu.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,13 @@ def __init__(

def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
return _bleu_compute(
self._cooked_cands,
self._cooked_mrefs,
self._return_all_scores,
self._n,
self._option,
self._verbose,
cooked_cands=self._cooked_cands,
cooked_mrefs=self._cooked_mrefs,
return_all_scores=self._return_all_scores,
n=self._n,
option=self._option,
verbose=self._verbose,
return_1_to_n=False,
)

def extra_repr(self) -> str:
Expand Down
26 changes: 13 additions & 13 deletions src/aac_metrics/classes/cider_d.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,13 +49,13 @@ def __init__(

def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
return _cider_d_compute(
self._cooked_cands,
self._cooked_mrefs,
self._return_all_scores,
self._n,
self._sigma,
self._return_tfidf,
self._scale,
cooked_cands=self._cooked_cands,
cooked_mrefs=self._cooked_mrefs,
return_all_scores=self._return_all_scores,
n=self._n,
sigma=self._sigma,
return_tfidf=self._return_tfidf,
scale=self._scale,
)

def extra_repr(self) -> str:
Expand All @@ -75,10 +75,10 @@ def update(
mult_references: list[list[str]],
) -> None:
self._cooked_cands, self._cooked_mrefs = _cider_d_update(
candidates,
mult_references,
self._n,
self._tokenizer,
self._cooked_cands,
self._cooked_mrefs,
candidates=candidates,
mult_references=mult_references,
n=self._n,
tokenizer=self._tokenizer,
prev_cooked_cands=self._cooked_cands,
prev_cooked_mrefs=self._cooked_mrefs,
)
2 changes: 1 addition & 1 deletion src/aac_metrics/classes/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def __hash__(self) -> int:
class DCASE2023Evaluate(Evaluate):
"""Evaluate candidates with multiple references with DCASE2023 Audio Captioning metrics.
For more information, see :func:`~aac_metrics.functional.evaluate.aac_evaluate`.
For more information, see :func:`~aac_metrics.functional.evaluate.dcase2023_evaluate`.
"""

def __init__(
Expand Down
28 changes: 14 additions & 14 deletions src/aac_metrics/classes/fense.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ def __init__(
device: Union[str, torch.device, None] = "auto",
batch_size: int = 32,
reset_state: bool = True,
return_probs: bool = True,
return_probs: bool = False,
penalty: float = 0.9,
verbose: int = 0,
) -> None:
Expand All @@ -66,19 +66,19 @@ def __init__(

def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
return fense(
self._candidates,
self._mult_references,
self._return_all_scores,
self._sbert_model,
self._echecker,
self._echecker_tokenizer,
self._error_threshold,
self._device,
self._batch_size,
self._reset_state,
self._return_probs,
self._penalty,
self._verbose,
candidates=self._candidates,
mult_references=self._mult_references,
return_all_scores=self._return_all_scores,
sbert_model=self._sbert_model,
echecker=self._echecker,
echecker_tokenizer=self._echecker_tokenizer,
error_threshold=self._error_threshold,
device=self._device,
batch_size=self._batch_size,
reset_state=self._reset_state,
return_probs=self._return_probs,
penalty=self._penalty,
verbose=self._verbose,
)

def extra_repr(self) -> str:
Expand Down
22 changes: 11 additions & 11 deletions src/aac_metrics/classes/fluerr.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ def __init__(
device: Union[str, torch.device, None] = "auto",
batch_size: int = 32,
reset_state: bool = True,
return_probs: bool = True,
return_probs: bool = False,
verbose: int = 0,
) -> None:
echecker, echecker_tokenizer = _load_echecker_and_tokenizer(echecker, None, device, reset_state, verbose) # type: ignore
Expand All @@ -64,16 +64,16 @@ def __init__(

def compute(self) -> Union[tuple[dict[str, Tensor], dict[str, Tensor]], Tensor]:
return fluerr(
self._candidates,
self._return_all_scores,
self._echecker,
self._echecker_tokenizer,
self._error_threshold,
self._device,
self._batch_size,
self._reset_state,
self._return_probs,
self._verbose,
candidates=self._candidates,
return_all_scores=self._return_all_scores,
echecker=self._echecker,
echecker_tokenizer=self._echecker_tokenizer,
error_threshold=self._error_threshold,
device=self._device,
batch_size=self._batch_size,
reset_state=self._reset_state,
return_probs=self._return_probs,
verbose=self._verbose,
)

def extra_repr(self) -> str:
Expand Down
Loading

0 comments on commit a5f056f

Please sign in to comment.