feat: conll2003 and ontonotes ner configs (#1691)

LogicZMaksimka · IgnatovFedor · vaskonov · web-flow · commit ab737eecb9eb · 2024-08-12T19:09:29.000+03:00
Co-authored-by: Fedor Ignatov &lt;ignatov.fedor@gmail.com&gt;
Co-authored-by: vasily &lt;vasili.konov@gmail.com&gt;
diff --git a/README.md b/README.md
@@ -1,62 +1,29 @@
+# DeepPavlov 1.0
+
 [![License Apache 2.0](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)
 ![Python 3.6, 3.7, 3.8, 3.9, 3.10, 3.11](https://img.shields.io/badge/python-3.6%20%7C%203.7%20%7C%203.8%20%7C%203.9%20%7C%203.10%20%7C%203.11-green.svg)
 [![Downloads](https://pepy.tech/badge/deeppavlov)](https://pepy.tech/project/deeppavlov)
-<img align="right" height="27%" width="27%" src="docs/_static/deeppavlov_logo.png"/>
+[![Static Badge](https://img.shields.io/badge/DeepPavlov%20Community-blue)](https://forum.deeppavlov.ai/)
+[![Static Badge](https://img.shields.io/badge/DeepPavlov%20Demo-blue)](https://demo.deeppavlov.ai/)
 
-DeepPavlov is an open-source conversational AI library built on [PyTorch](https://pytorch.org/).
 
-DeepPavlov is designed for
-* development of production ready chat-bots and complex conversational systems,
-* research in the area of NLP and, particularly, of dialog systems.
+DeepPavlov 1.0 is an open-source NLP framework built on [PyTorch](https://pytorch.org/) and [transformers](https://github.com/huggingface/transformers). DeepPavlov 1.0 is created for modular and configuration-driven development of state-of-the-art NLP models and supports a wide range of NLP model applications. DeepPavlov 1.0 is designed for practitioners with limited knowledge of NLP/ML.
 
 ## Quick Links
 
-* Demo [*demo.deeppavlov.ai*](https://demo.deeppavlov.ai/)
-* Documentation [*docs.deeppavlov.ai*](http://docs.deeppavlov.ai/)
-    * Model List [*docs:features/*](http://docs.deeppavlov.ai/en/master/features/overview.html)
-    * Contribution Guide [*docs:contribution_guide/*](http://docs.deeppavlov.ai/en/master/devguides/contribution_guide.html)
-* Issues [*github/issues/*](https://github.com/deeppavlov/DeepPavlov/issues)
-* Forum [*forum.deeppavlov.ai*](https://forum.deeppavlov.ai/)
-* Blogs [*medium.com/deeppavlov*](https://medium.com/deeppavlov)
-* [Extended colab tutorials](https://github.com/deeppavlov/dp_tutorials)
-* Docker Hub [*hub.docker.com/u/deeppavlov/*](https://hub.docker.com/u/deeppavlov/) 
-    * Docker Images Documentation [*docs:docker-images/*](http://docs.deeppavlov.ai/en/master/intro/installation.html#docker-images)
-
-Please leave us [your feedback](https://forms.gle/i64fowQmiVhMMC7f9) on how we can improve the DeepPavlov framework.
-
-**Models**
-
-[Named Entity Recognition](http://docs.deeppavlov.ai/en/master/features/models/NER.html) | [Intent/Sentence Classification](http://docs.deeppavlov.ai/en/master/features/models/classification.html) |
-
-[Question Answering over Text (SQuAD)](http://docs.deeppavlov.ai/en/master/features/models/SQuAD.html) | [Knowledge Base Question Answering](http://docs.deeppavlov.ai/en/master/features/models/KBQA.html)
-
-[Sentence Similarity/Ranking](http://docs.deeppavlov.ai/en/master/features/models/neural_ranking.html) | [TF-IDF Ranking](http://docs.deeppavlov.ai/en/master/features/models/tfidf_ranking.html)
-
-[Syntactic Parsing](http://docs.deeppavlov.ai/en/master/features/models/syntax_parser.html) | [Morphological Tagging](http://docs.deeppavlov.ai/en/master/features/models/morpho_tagger.html)
-
-[Automatic Spelling Correction](http://docs.deeppavlov.ai/en/master/features/models/spelling_correction.html) | [Entity Extraction](http://docs.deeppavlov.ai/en/master/features/models/entity_extraction.html)
-
-[Open Domain Questions Answering](http://docs.deeppavlov.ai/en/master/features/models/ODQA.html) | [Russian SuperGLUE](http://docs.deeppavlov.ai/en/master/features/models/superglue.html)
-
-[Relation Extraction](http://docs.deeppavlov.ai/en/master/features/models/relation_extraction.html)
-
-**Embeddings**
-
-[BERT embeddings for the Russian, Polish, Bulgarian, Czech, and informal English](http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#bert)
+|name|Description|
+|--|--|
+| ⭐️ [*Demo*](https://demo.deeppavlov.ai/)|Check out our NLP models in the online demo|
+| 📚 [*Documentation*](http://docs.deeppavlov.ai/)|How to use DeepPavlov 1.0 and its features|
+| 🚀 [*Model List*](http://docs.deeppavlov.ai/en/master/features/overview.html)|Find the NLP model you need in the list of available models|
+| 🪐 [*Contribution Guide*](http://docs.deeppavlov.ai/en/master/devguides/contribution_guide.html)|Please read the contribution guidelines before making a contribution|
+| 🎛 [*Issues*](https://github.com/deeppavlov/DeepPavlov/issues)|If you have an issue with DeepPavlov, please let us know|
+| ⏩ [*Forum*](https://forum.deeppavlov.ai/)|Please let us know if you have a problem with DeepPavlov|
+| 📦 [*Blogs*](https://medium.com/deeppavlov)|Read about our current development|
+| 🦙 [Extended colab tutorials](https://github.com/deeppavlov/dp_tutorials)|Check out the code tutorials for our models|
+| 🌌 [*Docker Hub*](https://hub.docker.com/u/deeppavlov/)|Check out the Docker images for rapid deployment|
+| 👩‍🏫 [*Feedback*](https://forms.gle/i64fowQmiVhMMC7f9)|Please leave us your feedback to make DeepPavlov better|
 
-[ELMo embeddings for the Russian language](http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#elmo)
-
-[FastText embeddings for the Russian language](http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#fasttext)
-
-**Auto ML**
-
-[Tuning Models](http://docs.deeppavlov.ai/en/master/features/hypersearch.html)
-
-**Integrations**
-
-[REST API](http://docs.deeppavlov.ai/en/master/integrations/rest_api.html) | [Socket API](http://docs.deeppavlov.ai/en/master/integrations/socket_api.html)
-
-[Amazon AWS](http://docs.deeppavlov.ai/en/master/integrations/aws_ec2.html)
 
 ## Installation
 
@@ -65,11 +32,14 @@ Please leave us [your feedback](https://forms.gle/i64fowQmiVhMMC7f9) on how we c
 
 1. Create and activate a virtual environment:
     * `Linux`
+
     ```
     python -m venv env
     source ./env/bin/activate
     ```
+
 2. Install the package inside the environment:
+
     ```
     pip install deeppavlov
     ```
@@ -122,7 +92,7 @@ Dataset will be downloaded regardless of whether there was `-d` flag or not.
 
 To train on your own data you need to modify dataset reader path in the
 [train config doc](http://docs.deeppavlov.ai/en/master/intro/config_description.html#train-config).
-The data format is specified in the corresponding model doc page. 
+The data format is specified in the corresponding model doc page.
 
 There are even more actions you can perform with configs:
 
@@ -131,20 +101,19 @@ python -m deeppavlov <action> <config_path> [-d] [-i]
 ```
 
 * `<action>` can be
-    * `install` to install model requirements (same as `-i`),
-    * `download` to download model's data (same as `-d`),
-    * `train` to train the model on the data specified in the config file,
-    * `evaluate` to calculate metrics on the same dataset,
-    * `interact` to interact via CLI,
-    * `riseapi` to run a REST API server (see
+  * `install` to install model requirements (same as `-i`),
+  * `download` to download model's data (same as `-d`),
+  * `train` to train the model on the data specified in the config file,
+  * `evaluate` to calculate metrics on the same dataset,
+  * `interact` to interact via CLI,
+  * `riseapi` to run a REST API server (see
     [doc](http://docs.deeppavlov.ai/en/master/integrations/rest_api.html)),
-    * `predict` to get prediction for samples from *stdin* or from
+  * `predict` to get prediction for samples from *stdin* or from
       *<file_path>* if `-f <file_path>` is specified.
 * `<config_path>` specifies path (or name) of model's config file
 * `-d` downloads required data
 * `-i` installs model requirements
 
-
 ### Python
 
 To get predictions from a model interactively through Python, run
@@ -157,7 +126,9 @@ model = build_model(<config_path>, install=True, download=True)
 # get predictions for 'input_text1', 'input_text2'
 model(['input_text1', 'input_text2'])
 ```
+
 where
+
 * `install=True` installs model requirements (optional),
 * `download=True` downloads required data from web - pretrained model files and embeddings (optional),
 * `<config_path>` is model name (e.g. `'ner_ontonotes_bert_mult'`), path to the chosen model's config file (e.g.
@@ -174,7 +145,7 @@ model = train_model(<config_path>, install=True, download=True)
 
 To train on your own data you need to modify dataset reader path in the
 [train config doc](http://docs.deeppavlov.ai/en/master/intro/config_description.html#train-config).
-The data format is specified in the corresponding model doc page. 
+The data format is specified in the corresponding model doc page.
 
 You can also calculate metrics on the dataset specified in your config file:
 
diff --git a/deeppavlov/_meta.py b/deeppavlov/_meta.py
@@ -1,4 +1,4 @@
-__version__ = '1.6.0'
+__version__ = '1.7.0'
 __author__ = 'Neural Networks and Deep Learning lab, MIPT'
 __description__ = 'An open source library for building end-to-end dialog systems and training chatbots.'
 __keywords__ = ['NLP', 'NER', 'SQUAD', 'Intents', 'Chatbot']
diff --git a/deeppavlov/configs/ner/ner_conll2003_deberta_crf.json b/deeppavlov/configs/ner/ner_conll2003_deberta_crf.json
@@ -0,0 +1,134 @@
+{
+  "dataset_reader": {
+    "class_name": "conll2003_reader",
+    "data_path": "{DOWNLOADS_PATH}/conll2003/",
+    "dataset_name": "conll2003",
+    "provide_pos": false
+  },
+  "dataset_iterator": {
+    "class_name": "data_learning_iterator"
+  },
+  "chainer": {
+    "in": [
+      "x"
+    ],
+    "in_y": [
+      "y"
+    ],
+    "pipe": [
+      {
+        "class_name": "torch_transformers_ner_preprocessor",
+        "vocab_file": "{TRANSFORMER}",
+        "do_lower_case": false,
+        "max_seq_length": 512,
+        "max_subword_length": 15,
+        "token_masking_prob": 0.0,
+        "in": [
+          "x"
+        ],
+        "out": [
+          "x_tokens",
+          "x_subword_tokens",
+          "x_subword_tok_ids",
+          "startofword_markers",
+          "attention_mask",
+          "tokens_offsets"
+        ]
+      },
+      {
+        "id": "tag_vocab",
+        "class_name": "simple_vocab",
+        "unk_token": [
+          "O"
+        ],
+        "pad_with_zeros": true,
+        "save_path": "{MODEL_PATH}/tag.dict",
+        "load_path": "{MODEL_PATH}/tag.dict",
+        "fit_on": [
+          "y"
+        ],
+        "in": [
+          "y"
+        ],
+        "out": [
+          "y_ind"
+        ]
+      },
+      {
+        "class_name": "torch_transformers_sequence_tagger",
+        "n_tags": "#tag_vocab.len",
+        "pretrained_bert": "{TRANSFORMER}",
+        "attention_probs_keep_prob": 0.5,
+        "use_crf": true,
+        "encoder_layer_ids": [
+          -1
+        ],
+        "save_path": "{MODEL_PATH}/model",
+        "load_path": "{MODEL_PATH}/model",
+        "in": [
+          "x_subword_tok_ids",
+          "attention_mask",
+          "startofword_markers"
+        ],
+        "in_y": [
+          "y_ind"
+        ],
+        "out": [
+          "y_pred_ind",
+          "probas"
+        ]
+      },
+      {
+        "ref": "tag_vocab",
+        "in": [
+          "y_pred_ind"
+        ],
+        "out": [
+          "y_pred"
+        ]
+      }
+    ],
+    "out": [
+      "x_tokens",
+      "y_pred"
+    ]
+  },
+  "train": {
+    "metrics": [
+      {
+        "name": "ner_f1",
+        "inputs": [
+          "y",
+          "y_pred"
+        ]
+      },
+      {
+        "name": "ner_token_f1",
+        "inputs": [
+          "y",
+          "y_pred"
+        ]
+      }
+    ],
+    "evaluation_targets": [
+      "valid",
+      "test"
+    ],
+    "class_name": "torch_trainer"
+  },
+  "metadata": {
+    "variables": {
+      "ROOT_PATH": "~/.deeppavlov",
+      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
+      "MODELS_PATH": "{ROOT_PATH}/models",
+      "TRANSFORMER": "microsoft/deberta-v3-base",
+      "MODEL_PATH": "{MODELS_PATH}/ner_conll2003_deberta_crf"
+    },
+    "download": [
+      {
+        "url": "http://files.deeppavlov.ai/v1/ner/ner_conll2003_deberta_crf.tar.gz",
+        "subdir": "{MODEL_PATH}"
+      }
+    ]
+  }
+}
diff --git a/deeppavlov/configs/ner/ner_ontonotes_deberta_crf.json b/deeppavlov/configs/ner/ner_ontonotes_deberta_crf.json
@@ -0,0 +1,86 @@
+{
+  "dataset_reader": {
+    "class_name": "conll2003_reader",
+    "data_path": "{DOWNLOADS_PATH}/ontonotes/",
+    "dataset_name": "ontonotes",
+    "provide_pos": false
+  },
+  "dataset_iterator": {
+    "class_name": "data_learning_iterator"
+  },
+  "chainer": {
+    "in": ["x"],
+    "in_y": ["y"],
+    "pipe": [
+      {
+        "class_name": "torch_transformers_ner_preprocessor",
+        "vocab_file": "{TRANSFORMER}",
+        "do_lower_case": false,
+        "max_seq_length": 512,
+        "max_subword_length": 15,
+        "token_masking_prob": 0.0,
+        "in": ["x"],
+        "out": ["x_tokens", "x_subword_tokens", "x_subword_tok_ids", "startofword_markers", "attention_mask", "tokens_offsets"]
+      },
+      {
+        "id": "tag_vocab",
+        "class_name": "simple_vocab",
+        "unk_token": ["O"],
+        "pad_with_zeros": true,
+        "save_path": "{MODEL_PATH}/tag.dict",
+        "load_path": "{MODEL_PATH}/tag.dict",
+        "fit_on": ["y"],
+        "in": ["y"],
+        "out": ["y_ind"]
+      },
+      {
+        "class_name": "torch_transformers_sequence_tagger",
+        "n_tags": "#tag_vocab.len",
+        "pretrained_bert": "{TRANSFORMER}",
+        "attention_probs_keep_prob": 0.5,
+        "use_crf": true,
+        "encoder_layer_ids": [-1],
+        "save_path": "{MODEL_PATH}/model",
+        "load_path": "{MODEL_PATH}/model",
+        "in": ["x_subword_tok_ids", "attention_mask", "startofword_markers"],
+        "in_y": ["y_ind"],
+        "out": ["y_pred_ind", "probas"]
+      },
+      {
+        "ref": "tag_vocab",
+        "in": ["y_pred_ind"],
+        "out": ["y_pred"]
+      }
+    ],
+    "out": ["x_tokens", "y_pred"]
+  },
+  "train": {
+    "metrics": [
+      {
+        "name": "ner_f1",
+        "inputs": ["y", "y_pred"]
+      },
+      {
+        "name": "ner_token_f1",
+        "inputs": ["y", "y_pred"]
+      }
+    ],
+    "evaluation_targets": ["valid", "test"],
+    "class_name": "torch_trainer"
+  },
+  "metadata": {
+    "variables": {
+      "ROOT_PATH": "~/.deeppavlov",
+      "DOWNLOADS_PATH": "{ROOT_PATH}/downloads",
+      "MODELS_PATH": "{ROOT_PATH}/models",
+      "TRANSFORMER": "microsoft/deberta-v3-base",
+      "MODEL_PATH": "{MODELS_PATH}/ner_ontonotes_deberta_crf"
+    },
+    "download": [
+      {
+        "url": "http://files.deeppavlov.ai/v1/ner/ner_ontonotes_deberta_crf.tar.gz",
+        "subdir": "{MODEL_PATH}"
+      }
+    ]
+  }
+}
diff --git a/deeppavlov/core/common/requirements_registry.json b/deeppavlov/core/common/requirements_registry.json
@@ -148,7 +148,9 @@
   ],
   "torch_transformers_ner_preprocessor": [
     "{DEEPPAVLOV_PATH}/requirements/pytorch.txt",
-    "{DEEPPAVLOV_PATH}/requirements/transformers.txt"
+    "{DEEPPAVLOV_PATH}/requirements/transformers.txt",
+    "{DEEPPAVLOV_PATH}/requirements/sentencepiece.txt",
+    "{DEEPPAVLOV_PATH}/requirements/protobuf.txt"
   ],
   "torch_transformers_nll_ranker": [
     "{DEEPPAVLOV_PATH}/requirements/pytorch.txt",
diff --git a/deeppavlov/requirements/protobuf.txt b/deeppavlov/requirements/protobuf.txt
@@ -0,0 +1 @@
+protobuf<=3.20
diff --git a/deeppavlov/requirements/sentencepiece.txt b/deeppavlov/requirements/sentencepiece.txt
@@ -0,0 +1 @@
+sentencepiece==0.2.0
diff --git a/tests/test_quick_start.py b/tests/test_quick_start.py

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-__version__ = '1.6.0'`
	`1`	`+__version__ = '1.7.0'`
`2`	`2`	`__author__ = 'Neural Networks and Deep Learning lab, MIPT'`
`3`	`3`	`__description__ = 'An open source library for building end-to-end dialog systems and training chatbots.'`
`4`	`4`	`__keywords__ = ['NLP', 'NER', 'SQUAD', 'Intents', 'Chatbot']`