Skip to content
This repository has been archived by the owner on Jan 9, 2024. It is now read-only.

refactoring answer metric(BLEU, METEOR, ROUGE... etc ) #398

Open
Eastsidegunn opened this issue Dec 26, 2023 · 2 comments
Open

refactoring answer metric(BLEU, METEOR, ROUGE... etc ) #398

Eastsidegunn opened this issue Dec 26, 2023 · 2 comments
Assignees

Comments

@Eastsidegunn
Copy link
Contributor

We have few metric based on category of metric which can be controllable with few parameter (if based on n-gram, can choose n)
more flexible!

in my opinion,

  • give metrics initial values(maybe initial values followed by initial value of wrapped metric)

have more idea?

@Eastsidegunn Eastsidegunn self-assigned this Dec 26, 2023
@vkehfdl1
Copy link
Contributor

It will give great flexibility of metrics, but I think it doesn't have to implement in RAGchain.
There are mainly two reason.

  1. Each metrics might have conventional setting.
    We target RAG workflow and its researches, there must be conventional hyperparameter or setups to each metrics. Because all benchmark should perform in same setting, ideally. So, it is better that we suggest conventional setup for each metric.
  2. User can calculate score with their metrics after get their own result.
    We give whole pd.DataFrame that contain question, answer, gt answer, etc. User can easily calculate score with their own metric with this Dataframe. So, if someone wants to score their result with new metric, they can do that easily. (Maybe we can make guide for that later. I did once with Rare F1 metric.)

Plus, I think it will be too complicated to use our evaluator. Sometimes, framework should restrict flexibility for easy to use.

@Eastsidegunn
Copy link
Contributor Author

Ok,

I think that adding EM(Exactly match) metric is one of defined step.
(conclusion of surfing on many benchmark)

Actually, I can't perceive setup which I can suggest conventinally.
BLEU, and ROUGE score is wrapped official(maybe...? basically used) library...
they have many variation according to n or perspective.
this problem clearly need to be solved by our evaluator

In my think,
evaluator should open folloing functions

  • add costum normalizer and tokenizer
  • add metric function on existing metric list

how about metric_expaneded-version..?
like metric.py, metric_expanded-version is collection of various metrics that are not officially(?) accepted
name is tentative

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants