Skip to content

kanggo-sw/KoBART

This branch is 4 commits ahead of, 1 commit behind SKT-AI/KoBART:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

2ca19b6 Β· Jun 13, 2022

History

76 Commits
Dec 24, 2021
Feb 11, 2022
Feb 11, 2022
Dec 7, 2020
Dec 24, 2021
Dec 3, 2021
Dec 7, 2020
Jun 13, 2022
Jan 8, 2022
Dec 24, 2021

Repository files navigation

🀣 KoBART

BART(Bidirectional and Auto-Regressive Transformers)λŠ” μž…λ ₯ ν…μŠ€νŠΈ 일뢀에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ 이λ₯Ό λ‹€μ‹œ μ›λ¬ΈμœΌλ‘œ λ³΅κ΅¬ν•˜λŠ” autoencoder의 ν˜•νƒœλ‘œ ν•™μŠ΅μ΄ λ©λ‹ˆλ‹€. ν•œκ΅­μ–΄ BART(μ΄ν•˜ KoBART) λŠ” λ…Όλ¬Έμ—μ„œ μ‚¬μš©λœ Text Infilling λ…Έμ΄μ¦ˆ ν•¨μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 40GB μ΄μƒμ˜ ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ— λŒ€ν•΄μ„œ ν•™μŠ΅ν•œ ν•œκ΅­μ–΄ encoder-decoder μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€. 이λ₯Ό 톡해 λ„μΆœλœ KoBART-baseλ₯Ό λ°°ν¬ν•©λ‹ˆλ‹€.

bart

How to install

pip install git+https://github.com/kanggo-sw/KoBART#egg=kobart

Data

Data # of Sentences
Korean Wiki 5M
Other corpus 0.27B

ν•œκ΅­μ–΄ μœ„ν‚€ λ°±κ³Ό 이외, λ‰΄μŠ€, μ±…, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ v1.0(λŒ€ν™”, λ‰΄μŠ€, ...), μ²­μ™€λŒ€ ꡭ민청원 λ“±μ˜ λ‹€μ–‘ν•œ 데이터가 λͺ¨λΈ ν•™μŠ΅μ— μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Tokenizer

tokenizers νŒ¨ν‚€μ§€μ˜ Character BPE tokenizer둜 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

vocab μ‚¬μ΄μ¦ˆλŠ” 30,000 이며 λŒ€ν™”μ— 자주 μ“°μ΄λŠ” μ•„λž˜μ™€ 같은 이λͺ¨ν‹°μ½˜, 이λͺ¨μ§€ 등을 μΆ”κ°€ν•˜μ—¬ ν•΄λ‹Ή ν† ν°μ˜ 인식 λŠ₯λ ₯을 μ˜¬λ ΈμŠ΅λ‹ˆλ‹€.

πŸ˜€, 😁, πŸ˜†, πŸ˜…, 🀣, .. , :-), :), -), (-:...

λ˜ν•œ <unused0> ~ <unused99>λ“±μ˜ λ―Έμ‚¬μš© 토큰을 μ •μ˜ν•΄, ν•„μš”ν•œ subtasks에 따라 자유둭게 μ •μ˜ν•΄ μ‚¬μš©ν•  수 있게 ν–ˆμŠ΅λ‹ˆλ‹€.

>>> from kobart import get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> kobart_tokenizer.tokenize("μ•ˆλ…•ν•˜μ„Έμš”. ν•œκ΅­μ–΄ BART μž…λ‹ˆλ‹€.🀣:)l^o")
['β–μ•ˆλ…•ν•˜', 'μ„Έμš”.', 'β–ν•œκ΅­μ–΄', '▁B', 'A', 'R', 'T', 'β–μž…', 'λ‹ˆλ‹€.', '🀣', ':)', 'l^o']

Model

Model # of params Type # of layers # of heads ffn_dim hidden_dims
KoBART-base 124M Encoder 6 16 3072 768
Decoder 6 16 3072 768
>>> from transformers import BartModel
>>> from kobart import get_pytorch_kobart_model, get_kobart_tokenizer
>>> kobart_tokenizer = get_kobart_tokenizer()
>>> model = BartModel.from_pretrained(get_pytorch_kobart_model())
>>> inputs = kobart_tokenizer(['μ•ˆλ…•ν•˜μ„Έμš”.'], return_tensors='pt')
>>> model(inputs['input_ids'])
Seq2SeqModelOutput(last_hidden_state=tensor([[[-0.4418, -4.3673,  3.2404,  ...,  5.8832,  4.0629,  3.5540],
         [-0.1316, -4.6446,  2.5955,  ...,  6.0093,  2.7467,  3.0007]]],
       grad_fn=<NativeLayerNormBackward>), past_key_values=((tensor([[[[-9.7980e-02, -6.6584e-01, -1.8089e+00,  ...,  9.6023e-01, -1.8818e-01, -1.3252e+00],

Performances

Classification or Regression

NSMC(acc) KorSTS(spearman) Question Pair(acc)
-----------------------------------
KoBART-base 90.24 81.66 94.34

Summarization

  • μ—…λ°μ΄νŠΈ μ˜ˆμ • *

Demos

μœ„ μ˜ˆμ‹œλŠ” ZDNET 기사λ₯Ό μš”μ•½ν•œ κ²°κ³Όμž„

Examples

KoBARTλ₯Ό μ‚¬μš©ν•œ ν₯미둜운 μ˜ˆμ œκ°€ μžˆλ‹€λ©΄ PRμ£Όμ„Έμš”!

Release

  • v0.5.1
    • guide default 'import statements'
  • v0.5
    • download large files from aws s3
  • v0.4
    • Update model binary
  • v0.3
    • ν† ν¬λ‚˜μ΄μ € λ²„κ·Έλ‘œ 인해 <unk> 토큰이 μ‚¬λΌμ§€λŠ” 이슈 ν•΄κ²°
  • v0.2
    • KoBART λͺ¨λΈ μ—…λ°μ΄νŠΈ(μ„œλΈŒν…ŒμŠ€νŠΈ sample efficientκ°€ 쒋아짐)
    • λͺ¨λ‘μ˜ λ§λ­‰μΉ˜ μ‚¬μš© 버전 λͺ…μ‹œ
    • downloder 버그 μˆ˜μ •
    • pip μ„€μΉ˜ 지원

Contacts

KoBART κ΄€λ ¨ μ΄μŠˆλŠ” 이곳에 μ˜¬λ €μ£Όμ„Έμš”.

License

KoBARTλŠ” modified MIT λΌμ΄μ„ μŠ€ ν•˜μ— κ³΅κ°œλ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ 및 μ½”λ“œλ₯Ό μ‚¬μš©ν•  경우 λΌμ΄μ„ μŠ€ λ‚΄μš©μ„ μ€€μˆ˜ν•΄μ£Όμ„Έμš”. λΌμ΄μ„ μŠ€ 전문은 LICENSE νŒŒμΌμ—μ„œ ν™•μΈν•˜μ‹€ 수 μžˆμŠ΅λ‹ˆλ‹€.

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Dockerfile 1.6%