Skip to content

Commit

Permalink
Merge branch 'release/0.3.0'
Browse files Browse the repository at this point in the history
  • Loading branch information
emfomy committed Jun 26, 2022
2 parents d3abf92 + b6cb39e commit e494c42
Show file tree
Hide file tree
Showing 9 changed files with 82 additions and 77 deletions.
56 changes: 29 additions & 27 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -177,14 +177,15 @@ Model Performance
Model #Parameters Perplexity† WS (F1)‡ POS (ACC)‡ NER (F1)‡
================================ =========== =========== ======== ========== =========
ckiplab/albert-tiny-chinese 4M 4.80 96.66% 94.48% 71.17%
ckiplab/albert-base-chinese 10M 2.65 97.33% 95.30% 79.47%
ckiplab/albert-base-chinese 11M 2.65 97.33% 95.30% 79.47%
ckiplab/bert-tiny-chinese 12M 8.07 96.98% 95.11% 74.21%
ckiplab/bert-base-chinese 102M 1.88 97.60% 95.67% 81.18%
ckiplab/gpt2-base-chinese 102M 14.40 -- -- --
ckiplab/gpt2-base-chinese 102M 8.36 -- -- --
-------------------------------- ----------- ----------- -------- ---------- ---------

-------------------------------- ----------- ----------- -------- ---------- ---------
voidful/albert_chinese_tiny 4M 74.93 -- -- --
voidful/albert_chinese_base 10M 22.34 -- -- --
voidful/albert_chinese_base 11M 22.34 -- -- --
bert-base-chinese 102M 2.53 -- -- --
================================ =========== =========== ======== ========== =========

Expand Down Expand Up @@ -298,15 +299,15 @@ NLP Tools Usage
2. Load models
""""""""""""""

| We provide three levels (1–3) of drivers. Level 1 is the fastest, and level 3 (default) is the most accurate.
| 我們的工具分為三個等級(1—3)。等級一最快,等級三(預設值)最精準
| We provide several pretrained models for the NLP tools.
| 我們提供了一些適用於自然語言工具的預訓練的模型
.. code-block:: python
# Initialize drivers
ws_driver = CkipWordSegmenter(level=3)
pos_driver = CkipPosTagger(level=3)
ner_driver = CkipNerChunker(level=3)
ws_driver = CkipWordSegmenter(model="bert-base")
pos_driver = CkipPosTagger(model="bert-base")
ner_driver = CkipNerChunker(model="bert-base")
| One may also load their own checkpoints using our drivers.
| 也可以運用我們的工具於自己訓練的模型上。
Expand Down Expand Up @@ -354,7 +355,7 @@ NLP Tools Usage
| The POS driver will automatically segment the sentence internally using there characters ``',,。::;;!!??'`` while running the model. (The output sentences will be concatenated back.) You may set ``delim_set`` to any characters you want.
| You may set ``use_delim=False`` to disable this feature, or set ``use_delim=True`` in WS and NER driver to enable this feature.
| 詞性標記工具會自動用 ``',,。::;;!!??'`` 等字元在執行模型前切割句子(輸出的句子會自動接回)。可設定 ``delim_set`` 參數使用別的字元做切割。
| 另外可指定 ``use_delim=False`` 已停用此功能,或於斷詞、實體辨識時指定 ``use_delim=False`` 已啟用此功能。
| 另外可指定 ``use_delim=False`` 已停用此功能,或於斷詞、實體辨識時指定 ``use_delim=True`` 已啟用此功能。
.. code-block:: python
Expand Down Expand Up @@ -429,18 +430,19 @@ NLP Tools Performance
CKIP Transformers v.s. Monpa & Jeiba
""""""""""""""""""""""""""""""""""""

===== ======================== =========== ============= =============== ============
Level Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
===== ======================== =========== ============= =============== ============
3 CKIP BERT Base **97.60%** **95.67%** **94.19%** **81.18%**
2 CKIP ALBERT Base 97.33% 95.30% 93.52% 79.47%
1 CKIP ALBERT Tiny 96.66% 94.48% 92.25% 71.17%
----- ------------------------ ----------- ------------- --------------- ------------
======================== =========== ============= =============== ============
Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
======================== =========== ============= =============== ============
CKIP BERT Base **97.60%** **95.67%** **94.19%** **81.18%**
CKIP ALBERT Base 97.33% 95.30% 93.52% 79.47%
CKIP BERT Tiny 96.98% 95.08% 93.13% 74.20%
CKIP ALBERT Tiny 96.66% 94.48% 92.25% 71.17%
------------------------ ----------- ------------- --------------- ------------

----- ------------------------ ----------- ------------- --------------- ------------
-- Monpa† 92.58% -- 83.88% --
-- Jeiba 81.18% -- -- --
===== ======================== =========== ============= =============== ============
------------------------ ----------- ------------- --------------- ------------
Monpa† 92.58% -- 83.88% --
Jeiba 81.18% -- -- --
======================== =========== ============= =============== ============

| † Monpa provides only 3 types of tags in NER.
| † Monpa 的實體辨識僅提供三種標記而已。
Expand All @@ -451,12 +453,12 @@ CKIP Transformers v.s. CkipTagger
| The following results are tested on a different dataset.†
| 以下實驗在另一個資料集測試。†
===== ======================== =========== ============= =============== ============
Level Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
===== ======================== =========== ============= =============== ============
3 CKIP BERT Base **97.84%** 96.46% **94.91%** **79.20%**
-- CkipTagger 97.33% **97.20%** 94.75% 77.87%
===== ======================== =========== ============= =============== ============
======================== =========== ============= =============== ============
Tool WS (F1) POS (Acc) WS+POS (F1) NER (F1)
======================== =========== ============= =============== ============
CKIP BERT Base **97.84%** 96.46% **94.91%** **79.20%**
CkipTagger 97.33% **97.20%** 94.75% 77.87%
======================== =========== ============= =============== ============

| † Here we retrained/tested our BERT model using the same dataset with CkipTagger.
| † 我們重新訓練/測試我們的 BERT 模型於跟 CkipTagger 相同的資料集。
Expand All @@ -466,7 +468,7 @@ License

|GPL-3.0|

Copyright (c) 2020 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.
Copyright (c) 2021 `CKIP Lab <https://ckip.iis.sinica.edu.tw>`__ under the `GPL-3.0 License <https://www.gnu.org/licenses/gpl-3.0.html>`__.

.. |GPL-3.0| image:: https://www.gnu.org/graphics/gplv3-with-text-136x68.png
:target: https://www.gnu.org/licenses/gpl-3.0.html
4 changes: 2 additions & 2 deletions ckip_transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@

__author_name__ = "Mu Yang"
__author_email__ = "[email protected]"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"

__title__ = "CKIP Transformers"
__version__ = "0.2.8"
__version__ = "0.3.0"
__description__ = "CKIP Transformers"
__license__ = "GPL-3.0"

Expand Down
2 changes: 1 addition & 1 deletion ckip_transformers/nlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"""

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"

from .driver import (
Expand Down
67 changes: 35 additions & 32 deletions ckip_transformers/nlp/driver.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"""

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"

from typing import (
Expand All @@ -28,27 +28,28 @@ class CkipWordSegmenter(CkipTokenClassification):
Parameters
----------
level : ``str`` *optional*, defaults to 3, must be 1—3
The model level. The higher the level is, the more accurate and slower the model is.
model_name : ``str`` *optional*, overwrites **level**
The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
device : ``int``, *optional*, defaults to -1,
model : ``str`` *optional*, defaults to "bert-base".
The pretrained model name provided by CKIP Transformers.
model_name : ``str`` *optional*, overwrites **model**
The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
device : ``int``, *optional*, defaults to -1
Device ordinal for CPU/GPU supports.
Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
"""

_model_names = {
1: "ckiplab/albert-tiny-chinese-ws",
2: "ckiplab/albert-base-chinese-ws",
3: "ckiplab/bert-base-chinese-ws",
"albert-tiny": "ckiplab/albert-tiny-chinese-ws",
"albert-base": "ckiplab/albert-base-chinese-ws",
"bert-tiny": "ckiplab/bert-tiny-chinese-ws",
"bert-base": "ckiplab/bert-base-chinese-ws",
}

def __init__(
self,
level: int = 3,
model: str = "bert-base",
**kwargs,
):
model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
model_name = kwargs.pop("model_name", self._get_model_name(model))
super().__init__(model_name=model_name, **kwargs)

def __call__(
Expand Down Expand Up @@ -127,27 +128,28 @@ class CkipPosTagger(CkipTokenClassification):
Parameters
----------
level : ``str`` *optional*, defaults to 3, must be 1—3
The model level. The higher the level is, the more accurate and slower the model is.
model_name : ``str`` *optional*, overwrites **level**
The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-pos'``).
device : ``int``, *optional*, defaults to -1,
model : ``str`` *optional*, defaults to "bert-base".
The pretrained model name provided by CKIP Transformers.
model_name : ``str`` *optional*, overwrites **model**
The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-pos'``).
device : ``int``, *optional*, defaults to -1
Device ordinal for CPU/GPU supports.
Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
"""

_model_names = {
1: "ckiplab/albert-tiny-chinese-pos",
2: "ckiplab/albert-base-chinese-pos",
3: "ckiplab/bert-base-chinese-pos",
"albert-tiny": "ckiplab/albert-tiny-chinese-pos",
"albert-base": "ckiplab/albert-base-chinese-pos",
"bert-tiny": "ckiplab/bert-tiny-chinese-pos",
"bert-base": "ckiplab/bert-base-chinese-pos",
}

def __init__(
self,
level: int = 3,
model: str = "bert-base",
**kwargs,
):
model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
model_name = kwargs.pop("model_name", self._get_model_name(model))
super().__init__(model_name=model_name, **kwargs)

def __call__(
Expand Down Expand Up @@ -216,27 +218,28 @@ class CkipNerChunker(CkipTokenClassification):
Parameters
----------
level : ``str`` *optional*, defaults to 3, must be 1—3
The model level. The higher the level is, the more accurate and slower the model is.
model_name : ``str`` *optional*, overwrites **level**
The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ner'``).
device : ``int``, *optional*, defaults to -1,
model : ``str`` *optional*, defaults to "bert-base".
The pretrained model name provided by CKIP Transformers.
model_name : ``str`` *optional*, overwrites **model**
The custom pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ner'``).
device : ``int``, *optional*, defaults to -1
Device ordinal for CPU/GPU supports.
Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
"""

_model_names = {
1: "ckiplab/albert-tiny-chinese-ner",
2: "ckiplab/albert-base-chinese-ner",
3: "ckiplab/bert-base-chinese-ner",
"albert-tiny": "ckiplab/albert-tiny-chinese-ner",
"albert-base": "ckiplab/albert-base-chinese-ner",
"bert-tiny": "ckiplab/bert-tiny-chinese-ner",
"bert-base": "ckiplab/bert-base-chinese-ner",
}

def __init__(
self,
level: int = 3,
model: str = "bert-base",
**kwargs,
):
model_name = kwargs.pop("model_name", self._get_model_name_from_level(level))
model_name = kwargs.pop("model_name", self._get_model_name(model))
super().__init__(model_name=model_name, **kwargs)

def __call__(
Expand All @@ -251,7 +254,7 @@ def __call__(
Parameters
----------
input_text : ``List[str]``
The input sentences. Each sentence is a string or a list or string (words).
The input sentences. Each sentence is a string.
use_delim : ``bool``, *optional*, defaults to False
Segment sentence (internally) using ``delim_set``.
delim_set : `str`, *optional*, defaults to ``',,。::;;!!??'``
Expand Down
12 changes: 6 additions & 6 deletions ckip_transformers/nlp/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"""

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"


Expand Down Expand Up @@ -51,7 +51,7 @@ class CkipTokenClassification(metaclass=ABCMeta):
The pretrained model name (e.g. ``'ckiplab/bert-base-chinese-ws'``).
tokenizer_name : ``str``, *optional*, defaults to **model_name**
The pretrained tokenizer name (e.g. ``'bert-base-chinese'``).
device : ``int``, *optional*, defaults to -1,
device : ``int``, *optional*, defaults to -1
Device ordinal for CPU/GPU supports.
Setting this to -1 will leverage CPU, a positive will run the model on the associated CUDA device id.
"""
Expand All @@ -76,14 +76,14 @@ def __init__(
def _model_names(cls):
return NotImplemented # pragma: no cover

def _get_model_name_from_level(
def _get_model_name(
self,
level: int,
model: str,
):
try:
model_name = self._model_names[level]
model_name = self._model_names[model]
except KeyError as exc:
raise KeyError(f"Invalid level {level}") from exc
raise KeyError(f"Invalid model {model}") from exc

return model_name

Expand Down
6 changes: 3 additions & 3 deletions example/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ def main():

# Initialize drivers
print("Initializing drivers ... WS")
ws_driver = CkipWordSegmenter(level=3)
ws_driver = CkipWordSegmenter(model="bert-base")
print("Initializing drivers ... POS")
pos_driver = CkipPosTagger(level=3)
pos_driver = CkipPosTagger(model="bert-base")
print("Initializing drivers ... NER")
ner_driver = CkipNerChunker(level=3)
ner_driver = CkipNerChunker(model="bert-base")
print("Initializing drivers ... done")
print()

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding:utf-8 -*-

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"

from setuptools import setup, find_namespace_packages
Expand Down
2 changes: 1 addition & 1 deletion test/script/nlp/_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding:utf-8 -*-

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"

from ckip_transformers.nlp import *
Expand Down
8 changes: 4 additions & 4 deletions test/script/nlp/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding:utf-8 -*-

__author__ = "Mu Yang <http://muyang.pro>"
__copyright__ = "2020 CKIP Lab"
__copyright__ = "2021 CKIP Lab"
__license__ = "GPL-3.0"

from _base import *
Expand All @@ -11,7 +11,7 @@


def test_word_segmenter():
nlp = CkipWordSegmenter(level=1)
nlp = CkipWordSegmenter(model="albert-tiny")
output_ws = nlp(text, show_progress=False)
assert output_ws == ws

Expand All @@ -20,7 +20,7 @@ def test_word_segmenter():


def test_pos_tagger():
nlp = CkipPosTagger(level=1)
nlp = CkipPosTagger(model="albert-tiny")
output_pos = nlp(ws, show_progress=False)
assert output_pos == pos

Expand All @@ -29,7 +29,7 @@ def test_pos_tagger():


def test_ner_chunker():
nlp = CkipNerChunker(level=1)
nlp = CkipNerChunker(model="albert-tiny")
output_ner = nlp(text, show_progress=False)
output_ner = [[tuple(entity) for entity in sent] for sent in output_ner]
assert output_ner == ner

0 comments on commit e494c42

Please sign in to comment.