-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[ New features ] : add aiXcoder model implementation with tokenizer and validation #2902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
YoctoHan
wants to merge
4
commits into
PaddlePaddle:develop
Choose a base branch
from
YoctoHan:feat/aixcoder
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| ### data | ||
| train_dataset_type: erniekit | ||
| eval_dataset_type: erniekit | ||
| train_dataset_path: /workspace/pretrainning/data/pt/train_sft.jsonl | ||
| train_dataset_prob: "1.0" | ||
| eval_dataset_path: /workspace/pretrainning/data/pt/eval_sft.jsonl | ||
| eval_dataset_prob: "1.0" | ||
| max_seq_len: 8192 | ||
| num_samples_each_epoch: 6000000 | ||
| packing: false | ||
| mix_strategy: concat | ||
|
|
||
| ### model | ||
| model_name_or_path: /workspace/aiXcoder-7B | ||
| attn_impl: flashmask | ||
|
|
||
| ### finetuning | ||
| # base | ||
| stage: SFT | ||
| fine_tuning: full | ||
| seed: 23 | ||
| do_train: true | ||
| do_eval: true | ||
| per_device_eval_batch_size: 1 | ||
| per_device_train_batch_size: 1 | ||
| num_train_epochs: 1 | ||
| max_steps: -1 | ||
| eval_steps: 100 | ||
| evaluation_strategy: steps | ||
| save_steps: 100 | ||
| save_total_limit: 1 | ||
| save_strategy: steps | ||
| logging_steps: 1 | ||
| gradient_accumulation_steps: 4 | ||
| logging_dir: /workspace/pretrainning/vdl_log | ||
| output_dir: /workspace/pretrainning/checkpoints/aixcoder-7b-base-pd-converted_sft_ckpts | ||
| disable_tqdm: true | ||
| eval_accumulation_steps: 16 | ||
|
|
||
| # train | ||
| warmup_steps: 20 | ||
| learning_rate: 1.0e-5 | ||
|
|
||
| # performance | ||
| tensor_parallel_degree: 1 | ||
| pipeline_parallel_degree: 1 | ||
| sharding: stage2 | ||
| recompute: true | ||
| bf16: true | ||
| fp16_opt_level: O2 | ||
| unified_checkpoint: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| ### data | ||
| train_dataset_type: erniekit | ||
| eval_dataset_type: erniekit | ||
| train_dataset_path: /workspace/pretrainning/data/pt/train_sft.jsonl | ||
| train_dataset_prob: "1.0" | ||
| eval_dataset_path: /workspace/pretrainning/data/pt/eval_sft.jsonl | ||
| eval_dataset_prob: "1.0" | ||
| max_seq_len: 1024 | ||
| num_samples_each_epoch: 100 | ||
| packing: true | ||
| mix_strategy: concat | ||
|
|
||
| ### model | ||
| model_name_or_path: /workspace/aixcoder-7b-base-pd-converted | ||
| convert_from_hf: false | ||
| save_to_hf: false | ||
| attn_impl: flashmask | ||
|
|
||
| ### finetuning | ||
| # base | ||
| stage: SFT | ||
| fine_tuning: full | ||
| seed: 23 | ||
| do_train: true | ||
| do_eval: true | ||
| per_device_eval_batch_size: 1 | ||
| per_device_train_batch_size: 1 | ||
| num_train_epochs: 1 | ||
| max_steps: -1 | ||
| eval_steps: 100 | ||
| evaluation_strategy: steps | ||
| save_steps: 100 | ||
| save_total_limit: 1 | ||
| save_strategy: steps | ||
| logging_steps: 1 | ||
| gradient_accumulation_steps: 4 | ||
| logging_dir: /workspace/pretrainning/vdl_log | ||
| output_dir: /workspace/pretrainning/checkpoints/aixcoder-7b-base-pd-converted_sft_ckpts_parallel | ||
| disable_tqdm: true | ||
| eval_accumulation_steps: 16 | ||
|
|
||
| # train | ||
| warmup_steps: 20 | ||
| learning_rate: 1.0e-5 | ||
|
|
||
| # performance | ||
| tensor_parallel_degree: 8 | ||
| pipeline_parallel_degree: 1 | ||
| sequence_parallel: true | ||
| sharding: stage1 | ||
| recompute: true | ||
| bf16: true | ||
| fp16_opt_level: O2 | ||
| unified_checkpoint: true |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| AIXCODER-7B MODEL LICENSE AGREEMENT | ||
|
|
||
| aiXcoder-7B Version Release Date: 2024 | ||
|
|
||
| "Agreement" means the terms and conditions for use, reproduction, distribution and | ||
| modification of the aiXcoder Materials set forth herein. | ||
|
|
||
| "Documentation" means the specifications, manuals and documentation | ||
| accompanying aiXcoder-7B distributed by aiXcoder at | ||
| https://huggingface.co/aiXcoder/aixcoder-7b-base. | ||
|
|
||
| "Licensee" or "you" means you, or your employer or any other person or entity (if | ||
| you are entering into this Agreement on such person or entity's behalf), of the age | ||
| required under applicable laws, rules or regulations to provide legal consent and that | ||
| has legal authority to bind your employer or such other person or entity if you are | ||
| entering in this Agreement on their behalf. | ||
|
|
||
| "aiXcoder-7B" means the foundational large language models and software and | ||
| algorithms, including machine-learning model code, trained model weights, | ||
| inference-enabling code, training-enabling code, fine-tuning enabling code and other | ||
| elements of the foregoing distributed by aiXcoder at | ||
| https://huggingface.co/aiXcoder/aixcoder-7b-base. | ||
|
|
||
| "aiXcoder Materials" means, collectively, aiXcoder's proprietary aiXcoder-7B and | ||
| Documentation (and any portion thereof) made available under this Agreement. | ||
|
|
||
| "aiXcoder" or "we" means aiXcoder and its affiliates. | ||
|
|
||
| By using or distributing any portion or element of the aiXcoder Materials, | ||
| you agree to be bound by this Agreement. | ||
|
|
||
| 1. License Rights and Redistribution. | ||
|
|
||
| a. Grant of Rights for Academic Research Use. You are granted a non-exclusive, | ||
| worldwide, non-transferable and royalty-free limited license under aiXcoder's | ||
| intellectual property or other rights owned by aiXcoder embodied in the aiXcoder | ||
| Materials to use, reproduce, distribute, copy, create derivative works of, and make | ||
| modifications to the aiXcoder Materials solely for academic research purposes. | ||
|
|
||
| b. Commercial Use. For commercial use of the aiXcoder Materials, you must | ||
| apply for a commercial license by sending an email to [email protected]. | ||
| Commercial use without explicit written permission from aiXcoder is prohibited. | ||
|
|
||
| c. Redistribution and Use. | ||
|
|
||
| i. If you distribute or make the aiXcoder Materials, or any derivative works | ||
| thereof, available to a third party, you shall provide a copy of this Agreement to such | ||
| third party. | ||
|
|
||
| ii. You must retain in all copies of the aiXcoder Materials that you | ||
| distribute the following attribution notice within a "Notice" text file distributed as a | ||
| part of such copies: "aiXcoder-7B is licensed under the aiXcoder Model License, | ||
| Copyright (c) aiXcoder. All Rights Reserved." | ||
|
|
||
| iii. Your use of the aiXcoder Materials must comply with applicable laws | ||
| and regulations (including trade compliance laws and regulations). | ||
|
|
||
| iv. You will not use the aiXcoder Materials or any output or results of the | ||
| aiXcoder Materials to improve any other large language model (excluding aiXcoder-7B | ||
| or derivative works thereof) without explicit permission. | ||
|
|
||
| 2. Restrictions. | ||
|
|
||
| You will not, and will not permit, assist or cause any third party to: | ||
|
|
||
| a. use, modify, copy, reproduce, create derivative works of, or distribute the | ||
| aiXcoder Materials (or any derivative works thereof, works incorporating the aiXcoder | ||
| Materials, or any data produced by the Software), in whole or in part, for (i) any | ||
| commercial or production purposes without proper license, (ii) military purposes or in | ||
| the service of nuclear technology, (iii) purposes of surveillance, including any research | ||
| or development relating to surveillance, (iv) biometric processing without proper consent, | ||
| (v) in any manner that infringes, misappropriates, or otherwise violates any third-party | ||
| rights, or (vi) in any manner that violates any applicable law; | ||
|
|
||
| b. alter or remove copyright and other proprietary notices which appear on or in | ||
| the aiXcoder Materials; | ||
|
|
||
| c. utilize any equipment, device, software, or other means to circumvent or remove | ||
| any security or protection used by aiXcoder in connection with the Software, or to | ||
| circumvent or remove any usage restrictions, or to enable functionality disabled by | ||
| aiXcoder. | ||
|
|
||
| 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE | ||
| AIXCODER MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE | ||
| PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, | ||
| EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY | ||
| WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR | ||
| FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE | ||
| FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING | ||
| THE AIXCODER MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR | ||
| USE OF THE AIXCODER MATERIALS AND ANY OUTPUT AND RESULTS. | ||
|
|
||
| 4. Limitation of Liability. IN NO EVENT WILL AIXCODER OR ITS AFFILIATES BE | ||
| LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, | ||
| NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS | ||
| AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, | ||
| CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN | ||
| IF AIXCODER OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF | ||
| ANY OF THE FOREGOING. | ||
|
|
||
| 5. Intellectual Property. | ||
|
|
||
| a. No trademark licenses are granted under this Agreement, and in | ||
| connection with the aiXcoder Materials, neither aiXcoder nor Licensee may use any name | ||
| or mark owned by or associated with the other or any of its affiliates, except as | ||
| required for reasonable and customary use in describing and redistributing the | ||
| aiXcoder Materials. | ||
|
|
||
| b. Subject to aiXcoder's ownership of aiXcoder Materials and derivatives made by or | ||
| for aiXcoder, with respect to any derivative works and modifications of the aiXcoder | ||
| Materials that are made by you, as between you and aiXcoder, you are and will be the | ||
| owner of such derivative works and modifications. | ||
|
|
||
| c. If you institute litigation or other proceedings against aiXcoder or any entity | ||
| (including a cross-claim or counterclaim in a lawsuit) alleging that the aiXcoder | ||
| Materials or aiXcoder-7B outputs or results, or any portion of any of the foregoing, | ||
| constitutes infringement of intellectual property or other rights owned or licensable | ||
| by you, then any licenses granted to you under this Agreement shall terminate as of | ||
| the date such litigation or claim is filed or instituted. You will indemnify and hold | ||
| harmless aiXcoder from and against any claim by any third party arising out of or related | ||
| to your use or distribution of the aiXcoder Materials. | ||
|
|
||
| 6. Term and Termination. The term of this Agreement will commence upon your | ||
| acceptance of this Agreement or access to the aiXcoder Materials and will continue in | ||
| full force and effect until terminated in accordance with the terms and conditions | ||
| herein. aiXcoder may terminate this Agreement if you are in breach of any term or | ||
| condition of this Agreement. Upon termination of this Agreement, you shall delete | ||
| and cease use of the aiXcoder Materials. Sections 3, 4, 5 and 7 shall survive the | ||
| termination of this Agreement. | ||
|
|
||
| 7. Governing Law and Jurisdiction. This Agreement will be governed and | ||
| construed under the laws of the People's Republic of China without regard to choice of | ||
| law principles. The courts of China shall have jurisdiction of any dispute arising out of | ||
| this Agreement. | ||
|
|
||
| 8. Contact Information. For commercial licensing inquiries or any questions regarding | ||
| this Agreement, please contact: [email protected] | ||
|
|
||
| 9. Acknowledgments. We would like to thank all contributors to the open-source | ||
| projects and datasets that made this work possible. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license文件可删除