Merge branch 'release/0.3.0'

ckiplab · Jun 26, 2022 · e494c42 · e494c42
2 parents d3abf92 + b6cb39e
commit e494c42
Show file tree

Hide file tree

Showing 9 changed files with 82 additions and 77 deletions.
diff --git a/README.rst b/README.rst
@@ -177,14 +177,15 @@ Model Performance
 Model                             #Parameters  Perplexity†  WS (F1)‡  POS (ACC)‡  NER (F1)‡
 ================================  ===========  ===========  ========  ==========  =========
 ckiplab/albert-tiny-chinese         4M          4.80        96.66%    94.48%      71.17%
-ckiplab/albert-base-chinese        10M          2.65        97.33%    95.30%      79.47%
+ckiplab/albert-base-chinese        11M          2.65        97.33%    95.30%      79.47%
+ckiplab/bert-tiny-chinese          12M          8.07        96.98%    95.11%      74.21%
 ckiplab/bert-base-chinese         102M          1.88        97.60%    95.67%      81.18%
-ckiplab/gpt2-base-chinese         102M         14.40        --        --          --
+ckiplab/gpt2-base-chinese         102M          8.36        --        --          --
 --------------------------------  -----------  -----------  --------  ----------  ---------
 
 --------------------------------  -----------  -----------  --------  ----------  ---------
 voidful/albert_chinese_tiny         4M         74.93        --        --          --
-voidful/albert_chinese_base        10M         22.34        --        --          --
+voidful/albert_chinese_base        11M         22.34        --        --          --
 bert-base-chinese                 102M          2.53        --        --          --
 ================================  ===========  ===========  ========  ==========  =========
 
@@ -298,15 +299,15 @@ NLP Tools Usage
 2. Load models
 """"""""""""""
 
-| We provide three levels (1–3) of drivers. Level 1 is the fastest, and level 3 (default) is the most accurate.
-| 我們的工具分為三個等級（1—3）。等級一最快，等級三（預設值）最精準。
+| We provide several pretrained models for the NLP tools.
+| 我們提供了一些適用於自然語言工具的預訓練的模型。
 
 .. code-block:: python
 
    # Initialize drivers
-   ws_driver = CkipWordSegmenter(level=3)
-   pos_driver = CkipPosTagger(level=3)
-   ner_driver = CkipNerChunker(level=3)
+   ws_driver  = CkipWordSegmenter(model="bert-base")
+   pos_driver = CkipPosTagger(model="bert-base")
+   ner_driver = CkipNerChunker(model="bert-base")
 
 | One may also load their own checkpoints using our drivers.
 | 也可以運用我們的工具於自己訓練的模型上。
@@ -354,7 +355,7 @@ NLP Tools Usage
 | The POS driver will automatically segment the sentence internally using there characters ``'，,。：:；;！!？?'`` while running the model. (The output sentences will be concatenated back.) You may set ``delim_set`` to any characters you want.
 | You may set ``use_delim=False`` to disable this feature, or set ``use_delim=True`` in WS and NER driver to enable this feature.
 | 詞性標記工具會自動用 ``'，,。：:；;！!？?'`` 等字元在執行模型前切割句子（輸出的句子會自動接回）。可設定 ``delim_set`` 參數使用別的字元做切割。
-| 另外可指定 ``use_delim=False`` 已停用此功能，或於斷詞、實體辨識時指定 ``use_delim=False`` 已啟用此功能。
+| 另外可指定 ``use_delim=False`` 已停用此功能，或於斷詞、實體辨識時指定 ``use_delim=True`` 已啟用此功能。
 
 .. code-block:: python
 
@@ -429,18 +430,19 @@ NLP Tools Performance
 CKIP Transformers v.s. Monpa & Jeiba
 """"""""""""""""""""""""""""""""""""
 
-=====  ========================  ===========  =============  ===============  ============
-Level  Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
-=====  ========================  ===========  =============  ===============  ============
-3      CKIP BERT Base            **97.60%**   **95.67%**     **94.19%**       **81.18%**
-2      CKIP ALBERT Base            97.33%       95.30%         93.52%           79.47%
-1      CKIP ALBERT Tiny            96.66%       94.48%         92.25%           71.17%
------  ------------------------  -----------  -------------  ---------------  ------------
+========================  ===========  =============  ===============  ============
+Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
+========================  ===========  =============  ===============  ============
+CKIP BERT Base            **97.60%**   **95.67%**     **94.19%**       **81.18%**
+CKIP ALBERT Base            97.33%       95.30%         93.52%           79.47%
+CKIP BERT Tiny              96.98%       95.08%         93.13%           74.20%
+CKIP ALBERT Tiny            96.66%       94.48%         92.25%           71.17%
+------------------------  -----------  -------------  ---------------  ------------
 
------  ------------------------  -----------  -------------  ---------------  ------------
---     Monpa†                      92.58%       --             83.88%           --
---     Jeiba                       81.18%       --             --               --
-=====  ========================  ===========  =============  ===============  ============
+------------------------  -----------  -------------  ---------------  ------------
+Monpa†                      92.58%       --             83.88%           --
+Jeiba                       81.18%       --             --               --
+========================  ===========  =============  ===============  ============
 
 | † Monpa provides only 3 types of tags in NER.
 | † Monpa 的實體辨識僅提供三種標記而已。
@@ -451,12 +453,12 @@ CKIP Transformers v.s. CkipTagger
 | The following results are tested on a different dataset.†
 | 以下實驗在另一個資料集測試。†
 
-=====  ========================  ===========  =============  ===============  ============
-Level  Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
-=====  ========================  ===========  =============  ===============  ============
-3      CKIP BERT Base            **97.84%**     96.46%       **94.91%**       **79.20%**
---     CkipTagger                  97.33%     **97.20%**       94.75%           77.87%
-=====  ========================  ===========  =============  ===============  ============
+========================  ===========  =============  ===============  ============
+Tool                        WS (F1)      POS (Acc)      WS+POS (F1)      NER (F1)
+========================  ===========  =============  ===============  ============
+CKIP BERT Base            **97.84%**     96.46%       **94.91%**       **79.20%**
+CkipTagger                  97.33%     **97.20%**       94.75%           77.87%
+========================  ===========  =============  ===============  ============
 
 | † Here we retrained/tested our BERT model using the same dataset with CkipTagger.
 | † 我們重新訓練／測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。
@@ -466,7 +468,7 @@ License
 
 |GPL-3.0|
 
-Copyright (c) 2020 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.
+Copyright (c) 2021 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.
 
 .. |GPL-3.0| image:: https://www.gnu.org/graphics/gplv3-with-text-136x68.png
    :target: https://www.gnu.org/licenses/gpl-3.0.html
diff --git a/ckip_transformers/__init__.py b/ckip_transformers/__init__.py
@@ -7,10 +7,10 @@
 
 __author_name__ = "Mu Yang"
 __author_email__ = "[email protected]"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 
 __title__ = "CKIP Transformers"
-__version__ = "0.2.8"
+__version__ = "0.3.0"
 __description__ = "CKIP Transformers"
 __license__ = "GPL-3.0"
 

diff --git a/ckip_transformers/nlp/__init__.py b/ckip_transformers/nlp/__init__.py
@@ -6,7 +6,7 @@
 """
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 from .driver import (

diff --git a/ckip_transformers/nlp/driver.py b/ckip_transformers/nlp/driver.py
@@ -6,7 +6,7 @@
 """
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 from typing import (
@@ -28,27 +28,28 @@ class CkipWordSegmenter(CkipTokenClassification):
 
     Parameters
     ----------
-        level : ``str`` *optional*, defaults to 3, must be 1—3
-            The model level. The higher the level is, the more accurate and slower the model is.
-        model_name : ``str`` *optional*, overwrites **level**
-            The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
-        device : ``int``, *optional*, defaults to -1,
+        model : ``str`` *optional*, defaults to "bert-base".
+            The pretrained model name provided by CKIP Transformers.
+        model_name : ``str`` *optional*, overwrites **model**
+            The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
+        device : ``int``, *optional*, defaults to -1
             Device ordinal for CPU/GPU supports.
             Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
     """
 
     _model_names = {
-        1: "ckiplab/albert-tiny-chinese-ws",
-        2: "ckiplab/albert-base-chinese-ws",
-        3: "ckiplab/bert-base-chinese-ws",
+        "albert-tiny": "ckiplab/albert-tiny-chinese-ws",
+        "albert-base": "ckiplab/albert-base-chinese-ws",
+        "bert-tiny": "ckiplab/bert-tiny-chinese-ws",
+        "bert-base": "ckiplab/bert-base-chinese-ws",
     }
 
     def __init__(
         self,
-        level: int = 3,
+        model: str = "bert-base",
         **kwargs,
     ):
-        model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
+        model_name = kwargs.pop("model_name", self._get_model_name(model))
         super().__init__(model_name=model_name, **kwargs)
 
     def __call__(
@@ -127,27 +128,28 @@ class CkipPosTagger(CkipTokenClassification):
 
     Parameters
     ----------
-        level : ``str`` *optional*, defaults to 3, must be 1—3
-            The model level. The higher the level is, the more accurate and slower the model is.
-        model_name : ``str`` *optional*, overwrites **level**
-            The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-pos'``).
-        device : ``int``, *optional*, defaults to -1,
+        model : ``str`` *optional*, defaults to "bert-base".
+            The pretrained model name provided by CKIP Transformers.
+        model_name : ``str`` *optional*, overwrites **model**
+            The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-pos'``).
+        device : ``int``, *optional*, defaults to -1
             Device ordinal for CPU/GPU supports.
             Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
     """
 
     _model_names = {
-        1: "ckiplab/albert-tiny-chinese-pos",
-        2: "ckiplab/albert-base-chinese-pos",
-        3: "ckiplab/bert-base-chinese-pos",
+        "albert-tiny": "ckiplab/albert-tiny-chinese-pos",
+        "albert-base": "ckiplab/albert-base-chinese-pos",
+        "bert-tiny": "ckiplab/bert-tiny-chinese-pos",
+        "bert-base": "ckiplab/bert-base-chinese-pos",
     }
 
     def __init__(
         self,
-        level: int = 3,
+        model: str = "bert-base",
         **kwargs,
     ):
-        model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
+        model_name = kwargs.pop("model_name", self._get_model_name(model))
         super().__init__(model_name=model_name, **kwargs)
 
     def __call__(
@@ -216,27 +218,28 @@ class CkipNerChunker(CkipTokenClassification):
 
     Parameters
     ----------
-        level : ``str`` *optional*, defaults to 3, must be 1—3
-            The model level. The higher the level is, the more accurate and slower the model is.
-        model_name : ``str`` *optional*, overwrites **level**
-            The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ner'``).
-        device : ``int``, *optional*, defaults to -1,
+        model : ``str`` *optional*, defaults to "bert-base".
+            The pretrained model name provided by CKIP Transformers.
+        model_name : ``str`` *optional*, overwrites **model**
+            The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ner'``).
+        device : ``int``, *optional*, defaults to -1
             Device ordinal for CPU/GPU supports.
             Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
     """
 
     _model_names = {
-        1: "ckiplab/albert-tiny-chinese-ner",
-        2: "ckiplab/albert-base-chinese-ner",
-        3: "ckiplab/bert-base-chinese-ner",
+        "albert-tiny": "ckiplab/albert-tiny-chinese-ner",
+        "albert-base": "ckiplab/albert-base-chinese-ner",
+        "bert-tiny": "ckiplab/bert-tiny-chinese-ner",
+        "bert-base": "ckiplab/bert-base-chinese-ner",
     }
 
     def __init__(
         self,
-        level: int = 3,
+        model: str = "bert-base",
         **kwargs,
     ):
-        model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
+        model_name = kwargs.pop("model_name", self._get_model_name(model))
         super().__init__(model_name=model_name, **kwargs)
 
     def __call__(
@@ -251,7 +254,7 @@ def __call__(
         Parameters
         ----------
             input_text : ``List[str]``
-                The input sentences. Each sentence is a string or a list or string (words).
+                The input sentences. Each sentence is a string.
             use_delim : ``bool``, *optional*, defaults to False
                 Segment sentence (internally) using ``delim_set``.
             delim_set : `str`, *optional*, defaults to ``'，,。：:；;！!？?'``

diff --git a/ckip_transformers/nlp/util.py b/ckip_transformers/nlp/util.py
@@ -6,7 +6,7 @@
 """
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 
@@ -51,7 +51,7 @@ class CkipTokenClassification(metaclass=ABCMeta):
             The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
         tokenizer_name : ``str``, *optional*, defaults to **model_name**
             The pretrained tokenizer name (e.g. ``'bert-base-chinese'``).
-        device : ``int``, *optional*, defaults to -1,
+        device : ``int``, *optional*, defaults to -1
             Device ordinal for CPU/GPU supports.
             Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
     """
@@ -76,14 +76,14 @@ def __init__(
     def _model_names(cls):
         return NotImplemented  # pragma: no cover
 
-    def _get_model_name_from_level(
+    def _get_model_name(
         self,
-        level: int,
+        model: str,
     ):
         try:
-            model_name = self._model_names[level]
+            model_name = self._model_names[model]
         except KeyError as exc:
-            raise KeyError(f"Invalid level {level}") from exc
+            raise KeyError(f"Invalid model {model}") from exc
 
         return model_name
 

diff --git a/example/example.py b/example/example.py
@@ -12,11 +12,11 @@ def main():
 
     # Initialize drivers
     print("Initializing drivers ... WS")
-    ws_driver = CkipWordSegmenter(level=3)
+    ws_driver = CkipWordSegmenter(model="bert-base")
     print("Initializing drivers ... POS")
-    pos_driver = CkipPosTagger(level=3)
+    pos_driver = CkipPosTagger(model="bert-base")
     print("Initializing drivers ... NER")
-    ner_driver = CkipNerChunker(level=3)
+    ner_driver = CkipNerChunker(model="bert-base")
     print("Initializing drivers ... done")
     print()
 

diff --git a/setup.py b/setup.py
@@ -2,7 +2,7 @@
 # -*- coding:utf-8 -*-
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 from setuptools import setup, find_namespace_packages

diff --git a/test/script/nlp/_base.py b/test/script/nlp/_base.py
@@ -2,7 +2,7 @@
 # -*- coding:utf-8 -*-
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 from ckip_transformers.nlp import *

diff --git a/test/script/nlp/run.py b/test/script/nlp/run.py
@@ -2,7 +2,7 @@
 # -*- coding:utf-8 -*-
 
 __author__ = "Mu Yang <http://muyang.pro>"
-__copyright__ = "2020 CKIP Lab"
+__copyright__ = "2021 CKIP Lab"
 __license__ = "GPL-3.0"
 
 from _base import *
@@ -11,7 +11,7 @@
 
 
 def test_word_segmenter():
-    nlp = CkipWordSegmenter(level=1)
+    nlp = CkipWordSegmenter(model="albert-tiny")
     output_ws = nlp(text, show_progress=False)
     assert output_ws == ws
 
@@ -20,7 +20,7 @@ def test_word_segmenter():
 
 
 def test_pos_tagger():
-    nlp = CkipPosTagger(level=1)
+    nlp = CkipPosTagger(model="albert-tiny")
     output_pos = nlp(ws, show_progress=False)
     assert output_pos == pos
 
@@ -29,7 +29,7 @@ def test_pos_tagger():
 
 
 def test_ner_chunker():
-    nlp = CkipNerChunker(level=1)
+    nlp = CkipNerChunker(model="albert-tiny")
     output_ner = nlp(text, show_progress=False)
     output_ner = [[tuple(entity) for entity in sent] for sent in output_ner]
     assert output_ner == ner