Skip to content

Commit 45fe78e

Browse files
author
YoctoHan
committed
feat(model): add new Aixcoder model implementation with tokenizer and validation
- Implemented the Aixcoder model architecture - Added custom tokenization logic - Completed initial validation tests (pre-training verification) - Prepared for upcoming training and fine-tuning validation - Documented usage in README for model reproduction
1 parent d9e0dbc commit 45fe78e

File tree

12 files changed

+2449
-0
lines changed

12 files changed

+2449
-0
lines changed

examples/aiXcoder-7B/full.yaml

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
### data
2+
train_dataset_type: erniekit
3+
eval_dataset_type: erniekit
4+
train_dataset_path: /workspace/pretrainning/data/pt/train_sft.jsonl
5+
train_dataset_prob: "1.0"
6+
eval_dataset_path: /workspace/pretrainning/data/pt/eval_sft.jsonl
7+
eval_dataset_prob: "1.0"
8+
max_seq_len: 8192
9+
num_samples_each_epoch: 6000000
10+
packing: false
11+
mix_strategy: concat
12+
13+
### model
14+
model_name_or_path: /workspace/aiXcoder-7B
15+
attn_impl: flashmask
16+
17+
### finetuning
18+
# base
19+
stage: SFT
20+
fine_tuning: full
21+
seed: 23
22+
do_train: true
23+
do_eval: true
24+
per_device_eval_batch_size: 1
25+
per_device_train_batch_size: 1
26+
num_train_epochs: 1
27+
max_steps: -1
28+
eval_steps: 100
29+
evaluation_strategy: steps
30+
save_steps: 100
31+
save_total_limit: 1
32+
save_strategy: steps
33+
logging_steps: 1
34+
gradient_accumulation_steps: 4
35+
logging_dir: /workspace/pretrainning/vdl_log
36+
output_dir: /workspace/pretrainning/checkpoints/aixcoder-7b-base-pd-converted_sft_ckpts
37+
disable_tqdm: true
38+
eval_accumulation_steps: 16
39+
40+
# train
41+
warmup_steps: 20
42+
learning_rate: 1.0e-5
43+
44+
# performance
45+
tensor_parallel_degree: 1
46+
pipeline_parallel_degree: 1
47+
sharding: stage2
48+
recompute: true
49+
bf16: true
50+
fp16_opt_level: O2
51+
unified_checkpoint: true
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
### data
2+
train_dataset_type: erniekit
3+
eval_dataset_type: erniekit
4+
train_dataset_path: /workspace/pretrainning/data/pt/train_sft.jsonl
5+
train_dataset_prob: "1.0"
6+
eval_dataset_path: /workspace/pretrainning/data/pt/eval_sft.jsonl
7+
eval_dataset_prob: "1.0"
8+
max_seq_len: 1024
9+
num_samples_each_epoch: 100
10+
packing: true
11+
mix_strategy: concat
12+
13+
### model
14+
model_name_or_path: /workspace/aixcoder-7b-base-pd-converted
15+
convert_from_hf: false
16+
save_to_hf: false
17+
attn_impl: flashmask
18+
19+
### finetuning
20+
# base
21+
stage: SFT
22+
fine_tuning: full
23+
seed: 23
24+
do_train: true
25+
do_eval: true
26+
per_device_eval_batch_size: 1
27+
per_device_train_batch_size: 1
28+
num_train_epochs: 1
29+
max_steps: -1
30+
eval_steps: 100
31+
evaluation_strategy: steps
32+
save_steps: 100
33+
save_total_limit: 1
34+
save_strategy: steps
35+
logging_steps: 1
36+
gradient_accumulation_steps: 4
37+
logging_dir: /workspace/pretrainning/vdl_log
38+
output_dir: /workspace/pretrainning/checkpoints/aixcoder-7b-base-pd-converted_sft_ckpts_parallel
39+
disable_tqdm: true
40+
eval_accumulation_steps: 16
41+
42+
# train
43+
warmup_steps: 20
44+
learning_rate: 1.0e-5
45+
46+
# performance
47+
tensor_parallel_degree: 8
48+
pipeline_parallel_degree: 1
49+
sequence_parallel: true
50+
sharding: stage1
51+
recompute: true
52+
bf16: true
53+
fp16_opt_level: O2
54+
unified_checkpoint: true
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
AIXCODER-7B MODEL LICENSE AGREEMENT
2+
3+
aiXcoder-7B Version Release Date: 2024
4+
5+
"Agreement" means the terms and conditions for use, reproduction, distribution and
6+
modification of the aiXcoder Materials set forth herein.
7+
8+
"Documentation" means the specifications, manuals and documentation
9+
accompanying aiXcoder-7B distributed by aiXcoder at
10+
https://huggingface.co/aiXcoder/aixcoder-7b-base.
11+
12+
"Licensee" or "you" means you, or your employer or any other person or entity (if
13+
you are entering into this Agreement on such person or entity's behalf), of the age
14+
required under applicable laws, rules or regulations to provide legal consent and that
15+
has legal authority to bind your employer or such other person or entity if you are
16+
entering in this Agreement on their behalf.
17+
18+
"aiXcoder-7B" means the foundational large language models and software and
19+
algorithms, including machine-learning model code, trained model weights,
20+
inference-enabling code, training-enabling code, fine-tuning enabling code and other
21+
elements of the foregoing distributed by aiXcoder at
22+
https://huggingface.co/aiXcoder/aixcoder-7b-base.
23+
24+
"aiXcoder Materials" means, collectively, aiXcoder's proprietary aiXcoder-7B and
25+
Documentation (and any portion thereof) made available under this Agreement.
26+
27+
"aiXcoder" or "we" means aiXcoder and its affiliates.
28+
29+
By using or distributing any portion or element of the aiXcoder Materials,
30+
you agree to be bound by this Agreement.
31+
32+
1. License Rights and Redistribution.
33+
34+
a. Grant of Rights for Academic Research Use. You are granted a non-exclusive,
35+
worldwide, non-transferable and royalty-free limited license under aiXcoder's
36+
intellectual property or other rights owned by aiXcoder embodied in the aiXcoder
37+
Materials to use, reproduce, distribute, copy, create derivative works of, and make
38+
modifications to the aiXcoder Materials solely for academic research purposes.
39+
40+
b. Commercial Use. For commercial use of the aiXcoder Materials, you must
41+
apply for a commercial license by sending an email to [email protected].
42+
Commercial use without explicit written permission from aiXcoder is prohibited.
43+
44+
c. Redistribution and Use.
45+
46+
i. If you distribute or make the aiXcoder Materials, or any derivative works
47+
thereof, available to a third party, you shall provide a copy of this Agreement to such
48+
third party.
49+
50+
ii. You must retain in all copies of the aiXcoder Materials that you
51+
distribute the following attribution notice within a "Notice" text file distributed as a
52+
part of such copies: "aiXcoder-7B is licensed under the aiXcoder Model License,
53+
Copyright (c) aiXcoder. All Rights Reserved."
54+
55+
iii. Your use of the aiXcoder Materials must comply with applicable laws
56+
and regulations (including trade compliance laws and regulations).
57+
58+
iv. You will not use the aiXcoder Materials or any output or results of the
59+
aiXcoder Materials to improve any other large language model (excluding aiXcoder-7B
60+
or derivative works thereof) without explicit permission.
61+
62+
2. Restrictions.
63+
64+
You will not, and will not permit, assist or cause any third party to:
65+
66+
a. use, modify, copy, reproduce, create derivative works of, or distribute the
67+
aiXcoder Materials (or any derivative works thereof, works incorporating the aiXcoder
68+
Materials, or any data produced by the Software), in whole or in part, for (i) any
69+
commercial or production purposes without proper license, (ii) military purposes or in
70+
the service of nuclear technology, (iii) purposes of surveillance, including any research
71+
or development relating to surveillance, (iv) biometric processing without proper consent,
72+
(v) in any manner that infringes, misappropriates, or otherwise violates any third-party
73+
rights, or (vi) in any manner that violates any applicable law;
74+
75+
b. alter or remove copyright and other proprietary notices which appear on or in
76+
the aiXcoder Materials;
77+
78+
c. utilize any equipment, device, software, or other means to circumvent or remove
79+
any security or protection used by aiXcoder in connection with the Software, or to
80+
circumvent or remove any usage restrictions, or to enable functionality disabled by
81+
aiXcoder.
82+
83+
3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
84+
AIXCODER MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
85+
PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
86+
EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
87+
WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
88+
FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
89+
FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
90+
THE AIXCODER MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
91+
USE OF THE AIXCODER MATERIALS AND ANY OUTPUT AND RESULTS.
92+
93+
4. Limitation of Liability. IN NO EVENT WILL AIXCODER OR ITS AFFILIATES BE
94+
LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
95+
NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
96+
AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
97+
CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
98+
IF AIXCODER OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
99+
ANY OF THE FOREGOING.
100+
101+
5. Intellectual Property.
102+
103+
a. No trademark licenses are granted under this Agreement, and in
104+
connection with the aiXcoder Materials, neither aiXcoder nor Licensee may use any name
105+
or mark owned by or associated with the other or any of its affiliates, except as
106+
required for reasonable and customary use in describing and redistributing the
107+
aiXcoder Materials.
108+
109+
b. Subject to aiXcoder's ownership of aiXcoder Materials and derivatives made by or
110+
for aiXcoder, with respect to any derivative works and modifications of the aiXcoder
111+
Materials that are made by you, as between you and aiXcoder, you are and will be the
112+
owner of such derivative works and modifications.
113+
114+
c. If you institute litigation or other proceedings against aiXcoder or any entity
115+
(including a cross-claim or counterclaim in a lawsuit) alleging that the aiXcoder
116+
Materials or aiXcoder-7B outputs or results, or any portion of any of the foregoing,
117+
constitutes infringement of intellectual property or other rights owned or licensable
118+
by you, then any licenses granted to you under this Agreement shall terminate as of
119+
the date such litigation or claim is filed or instituted. You will indemnify and hold
120+
harmless aiXcoder from and against any claim by any third party arising out of or related
121+
to your use or distribution of the aiXcoder Materials.
122+
123+
6. Term and Termination. The term of this Agreement will commence upon your
124+
acceptance of this Agreement or access to the aiXcoder Materials and will continue in
125+
full force and effect until terminated in accordance with the terms and conditions
126+
herein. aiXcoder may terminate this Agreement if you are in breach of any term or
127+
condition of this Agreement. Upon termination of this Agreement, you shall delete
128+
and cease use of the aiXcoder Materials. Sections 3, 4, 5 and 7 shall survive the
129+
termination of this Agreement.
130+
131+
7. Governing Law and Jurisdiction. This Agreement will be governed and
132+
construed under the laws of the People's Republic of China without regard to choice of
133+
law principles. The courts of China shall have jurisdiction of any dispute arising out of
134+
this Agreement.
135+
136+
8. Contact Information. For commercial licensing inquiries or any questions regarding
137+
this Agreement, please contact: [email protected]
138+
139+
9. Acknowledgments. We would like to thank all contributors to the open-source
140+
projects and datasets that made this work possible.

0 commit comments

Comments
 (0)