Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] update save config file #84

Merged
merged 2 commits into from
Nov 21, 2024
Merged

[Feature] update save config file #84

merged 2 commits into from
Nov 21, 2024

Conversation

akiFQC
Copy link
Collaborator

@akiFQC akiFQC commented Nov 21, 2024

関連する Issue / PR

#46

PR をマージした後の挙動の変化

JMETBによる評価時に `--save_dir` 内に `jmteb_config.yaml`というconfig fileを保存したい

挙動の変更を達成するために行ったこと

`src/jmteb/__main__.py` に保存用のコードを追加

動作確認

  • テストが通ることを確認した
  • マージ先がdevブランチであることを確認した

@akiFQC akiFQC requested a review from lsz05 November 21, 2024 05:09
@akiFQC
Copy link
Collaborator Author

akiFQC commented Nov 21, 2024

This is sample of jmteb_config.yaml .

evaluators:
  amazon_counterfactual_classification:
    class_path: jmteb.evaluators.ClassificationEvaluator
    init_args:
      train_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: train
          name: amazon_counterfactual_classification
          text_key: text
          label_key: label
      val_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: amazon_counterfactual_classification
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: amazon_counterfactual_classification
          text_key: text
          label_key: label
      average: macro
      log_predictions: false
  amazon_review_classification:
    class_path: jmteb.evaluators.ClassificationEvaluator
    init_args:
      train_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: train
          name: amazon_review_classification
          text_key: text
          label_key: label
      val_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: amazon_review_classification
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: amazon_review_classification
          text_key: text
          label_key: label
      average: macro
      log_predictions: false
  esci:
    class_path: jmteb.evaluators.RerankingEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.reranking.data.HfRerankingQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: esci-query
          query_key: query
          retrieved_docs_key: retrieved_docs
          relevance_scores_key: relevance_scores
      test_query_dataset:
        class_path: jmteb.evaluators.reranking.data.HfRerankingQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: esci-query
          query_key: query
          retrieved_docs_key: retrieved_docs
          relevance_scores_key: relevance_scores
      doc_dataset:
        class_path: jmteb.evaluators.reranking.data.HfRerankingDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: esci-corpus
          id_key: docid
          text_key: text
      log_predictions: false
      top_n_docs_to_log: 5
  jagovfaqs_22k:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: jagovfaqs_22k-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: jagovfaqs_22k-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: jagovfaqs_22k-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 1000000
      log_predictions: false
      top_n_docs_to_log: 5
  jaqket:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: jaqket-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: jaqket-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: jaqket-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 1000000
      log_predictions: false
      top_n_docs_to_log: 5
  jsick:
    class_path: jmteb.evaluators.STSEvaluator
    init_args:
      val_dataset:
        class_path: jmteb.evaluators.sts.data.HfSTSDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: jsick
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.sts.data.HfSTSDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: jsick
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
      log_predictions: false
  jsts:
    class_path: jmteb.evaluators.STSEvaluator
    init_args:
      val_dataset:
        class_path: jmteb.evaluators.sts.data.HfSTSDataset
        init_args:
          path: sbintuitions/JMTEB
          split: train
          name: jsts
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.sts.data.HfSTSDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: jsts
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
      log_predictions: false
  livedoor_news:
    class_path: jmteb.evaluators.ClusteringEvaluator
    init_args:
      val_dataset:
        class_path: jmteb.evaluators.clustering.data.HfClusteringDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: livedoor_news
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.clustering.data.HfClusteringDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: livedoor_news
          text_key: text
          label_key: label
      log_predictions: false
  massive_intent_classification:
    class_path: jmteb.evaluators.ClassificationEvaluator
    init_args:
      train_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: train
          name: massive_intent_classification
          text_key: text
          label_key: label
      val_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: massive_intent_classification
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: massive_intent_classification
          text_key: text
          label_key: label
      average: macro
      log_predictions: false
  massive_scenario_classification:
    class_path: jmteb.evaluators.ClassificationEvaluator
    init_args:
      train_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: train
          name: massive_scenario_classification
          text_key: text
          label_key: label
      val_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: massive_scenario_classification
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.classification.data.HfClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: massive_scenario_classification
          text_key: text
          label_key: label
      average: macro
      log_predictions: false
  mewsc16:
    class_path: jmteb.evaluators.ClusteringEvaluator
    init_args:
      val_dataset:
        class_path: jmteb.evaluators.clustering.data.HfClusteringDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: mewsc16_ja
          text_key: text
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.clustering.data.HfClusteringDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: mewsc16_ja
          text_key: text
          label_key: label
      log_predictions: false
  mrtydi:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: mrtydi-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: mrtydi-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: mrtydi-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 10000
      log_predictions: false
      top_n_docs_to_log: 5
  nlp_journal_abs_intro:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: nlp_journal_abs_intro-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: nlp_journal_abs_intro-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: nlp_journal_abs_intro-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 1000000
      log_predictions: false
      top_n_docs_to_log: 5
  nlp_journal_title_abs:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: nlp_journal_title_abs-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: nlp_journal_title_abs-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: nlp_journal_title_abs-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 1000000
      log_predictions: false
      top_n_docs_to_log: 5
  nlp_journal_title_intro:
    class_path: jmteb.evaluators.RetrievalEvaluator
    init_args:
      val_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: nlp_journal_title_intro-query
          query_key: query
          relevant_docs_key: relevant_docs
      test_query_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalQueryDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: nlp_journal_title_intro-query
          query_key: query
          relevant_docs_key: relevant_docs
      doc_dataset:
        class_path: jmteb.evaluators.retrieval.data.HfRetrievalDocDataset
        init_args:
          path: sbintuitions/JMTEB
          split: corpus
          name: nlp_journal_title_intro-corpus
          id_key: docid
          text_key: text
      doc_chunk_size: 1000000
      log_predictions: false
      top_n_docs_to_log: 5
  paws_x_ja:
    class_path: jmteb.evaluators.PairClassificationEvaluator
    init_args:
      val_dataset:
        class_path: jmteb.evaluators.pair_classification.data.HfPairClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: validation
          name: paws_x_ja
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
      test_dataset:
        class_path: jmteb.evaluators.pair_classification.data.HfPairClassificationDataset
        init_args:
          path: sbintuitions/JMTEB
          split: test
          name: paws_x_ja
          sentence1_key: sentence1
          sentence2_key: sentence2
          label_key: label
save_dir:  /somewhere/checkpoints/checkpoint-100/jmteb_evaluation
overwrite_cache: true
log_predictions: true
embedder:
  class_path: jmteb.embedders.DataParallelSentenceBertEmbedder
  init_args:
    model_name_or_path: /somewhere/checkpoints/checkpoint-100
    batch_size: 16384
    normalize_embeddings: false
    max_seq_length: 512
    add_eos: false
    model_kwargs:
      torch_dtype: torch.bfloat16
    auto_find_batch_size: true

Copy link
Collaborator

@lsz05 lsz05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ありがとうございます!LGTMです!

@lsz05 lsz05 changed the title update save config file [Feature] update save config file Nov 21, 2024
@lsz05 lsz05 merged commit ccadd5d into dev Nov 21, 2024
3 checks passed
@lsz05 lsz05 mentioned this pull request Dec 11, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants