@@ -337,12 +337,12 @@ julia> tk = tokenizer(spm, "i love the julia language") #or tk = spm("i love the
337
337
" ▁julia"
338
338
" ▁language"
339
339
340
- julia> subword = tokenizer (spm, " unfriendly" )
341
- 2 - element Array{String,1 }:
340
+ julia> subword = tokenizer (spm, " unfriendly" )
341
+ 2 - element Array{String,1 }:
342
342
" ▁un"
343
343
" friendly"
344
344
345
- julia> para = spm (" Julia is a high-level, high-performance dynamic language for technical computing" )
345
+ julia> para = spm (" Julia is a high-level, high-performance dynamic language for technical computing" )
346
346
17 - element Array{String,1 }:
347
347
" ▁"
348
348
" J"
@@ -366,15 +366,15 @@ julia> tk = tokenizer(spm, "i love the julia language") #or tk = spm("i love the
366
366
Indices is usually used for deep learning models.
367
367
Index of special tokens in ALBERT are given below:
368
368
369
- 1 => < pad >
370
- 2 => < unk >
371
- 3 => [ CLS]
372
- 4 => [ SEP]
373
- 5 => [ MASK]
369
+ 1 ⇒ [ PAD ]
370
+ 2 ⇒ [ UNK ]
371
+ 3 ⇒ [ CLS]
372
+ 4 ⇒ [ SEP]
373
+ 5 ⇒ [ MASK]
374
374
375
375
376
376
``` julia
377
- julia> ids_from_tokens (spm , tk)
377
+ julia> ids_from_tokens (spm, tk)
378
378
4 - element Array{Int64,1 }:
379
379
32
380
380
340
@@ -383,13 +383,13 @@ julia> ids_from_tokens(spm , tk)
383
383
817
384
384
# we can also get sentences back from tokens
385
385
julia> sentence_from_tokens (tk)
386
- " i love the julia language"
386
+ " i love the julia language"
387
387
388
388
julia> sentence_from_token (subword)
389
- " unfriendly"
389
+ " unfriendly"
390
390
391
391
julia> sentence_from_tokens (para)
392
- " Julia is a high-level, high-performance dynamic language for technical computing"
392
+ " Julia is a high-level, high-performance dynamic language for technical computing"
393
393
```
394
394
395
395
## Contributing
0 commit comments