@@ -35,8 +35,8 @@ Currently this project is working on progress. And the code is not verified yet.
35
35
pip install bert-pytorch
36
36
```
37
37
38
+ ## Quickstart
38
39
39
- ## Usage
40
40
** NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator**
41
41
```
42
42
Welcome to the \t the jungle \n
@@ -47,32 +47,16 @@ I can stay \t here all night \n
47
47
``` shell
48
48
bert-vocab -c data/corpus.small -o data/corpus.small.vocab
49
49
```
50
- ``` shell
51
- usage: bert-vocab [-h] -c CORPUS_PATH -o OUTPUT_PATH [-s VOCAB_SIZE]
52
- [-e ENCODING] [-m MIN_FREQ]
53
- ```
50
+
54
51
### 2. Building BERT train dataset with your corpus
55
52
``` shell
56
53
bert-dataset -d data/corpus.small -v data/corpus.small.vocab -o data/dataset.small
57
54
```
58
55
59
- ``` shell
60
- usage: bert-dataset [-h] -v VOCAB_PATH -c CORPUS_PATH [-e ENCODING] -o
61
- OUTPUT_PATH [-w WORKERS]
62
- ```
63
-
64
56
### 3. Train your own BERT model
65
57
``` shell
66
58
bert -d data/dataset.small -v data/corpus.small.vocab -o output/
67
59
```
68
- ``` shell
69
- usage: bert [-h] -d TRAIN_DATASET [-t TEST_DATASET] -v VOCAB_PATH -o
70
- OUTPUT_DIR [-hs HIDDEN] [-n LAYERS] [-a ATTN_HEADS] [-s SEQ_LEN]
71
- [-b BATCH_SIZE] [-e EPOCHS] [-w NUM_WORKERS]
72
- [--corpus_lines CORPUS_LINES] [--lr LR]
73
- [--adam_weight_decay ADAM_WEIGHT_DECAY] [--adam_beta1 ADAM_BETA1]
74
- [--adam_beta2 ADAM_BETA2] [--log_freq LOG_FREQ] [-c CUDA]
75
- ```
76
60
77
61
## Language Model Pre-training
78
62
@@ -119,7 +103,6 @@ not directly captured by language modeling
119
103
2 . Randomly 50% of next sentence, gonna be unrelated sentence.
120
104
121
105
122
-
123
106
## Author
124
107
125
108
0 commit comments