-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intermediate_size
to GPT-NeoX models
#1212
base: main
Are you sure you want to change the base?
Conversation
This looks good. Need to make it consistent with mamba and RWKV. We also need some TODO statements about revisiting this once we add swiglu. @jahatef is on it |
Added support for mamba and RWKV, and added TODOs. |
With these changes, is there a point to having separate Linear and LLaMA Linear definitions? At a glance it looks like all the differences are configurable, with the only difference being what is assumed as the default behavior if unspecified. |
Refactored the activations and MLP layer to get rid of our redundant llama mlp class, and added some activations functions from https://arxiv.org/pdf/2002.05202. |
@@ -3,7 +3,7 @@ | |||
# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages | |||
# across the node boundaries ) | |||
"pipe_parallel_size": 1, | |||
"model_parallel_size": 1, | |||
"model_parallel_size": 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not change the existing configs unless necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The llama config changes are fine, to be clear
# set variables, mostly following mamba defaults | ||
self.d_model = neox_args.hidden_size | ||
self.d_state = 16 # state dimensions per channel | ||
self.d_conv = 4 # convolution width | ||
self.expand = 2 # linear projection expansion factors | ||
self.d_inner = int(self.expand * self.d_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this will cause a failure at https://github.com/EleutherAI/gpt-neox/pull/1212/files#diff-cf396efcb6001846b18513bfb48e1e2681f3d240e589bfef9ea17cbdfe6b1218R80
neox_args.d_inner = neox_args.intermediate_size | ||
if neox_args.expansion_factor: | ||
self.expand = neox_args.expansion_factor | ||
neox_args.d_inner = neox_args.intermediate_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neox_args.d_inner
is set both in the if-statement above, and also here?
@@ -247,11 +247,11 @@ def __init__(self, neox_args, layer_number): | |||
self.time_maa_k = nn.Parameter(1.0 - torch.pow(ddd, ratio_1_to_almost0)) | |||
self.time_maa_r = nn.Parameter(1.0 - torch.pow(ddd, ratio_1_to_almost0)) | |||
|
|||
self.key = nn.Linear(neox_args.hidden_size, neox_args.dim_ffn, bias=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer ffn
to ff
. Also, why are we swapping the ordering of dim
and ffn
in the first place?
ff_dim = int(2 * neox_args.hidden_size * 4 / 3) | ||
ff_dim = self.multiple_of * ((ff_dim + multiple_of - 1) // multiple_of) | ||
|
||
self.w1 = mpu.ColumnParallelLinear( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to test that tensor parallelism and MoE are maintained
or self.num_experts > 1 | ||
and self.moe_type == "deepspeed" | ||
): | ||
if self.num_experts > 1 and self.moe_type == "deepspeed": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we still be testing if this mlp is a swiglu?
The current implementation only allows to set
intermediate_size
for Llama models, but I would like to be capable to change theintermediate_size
in GPT-NeoX models.I have tested this implementation under a quick training, inference and conversion and it seems that it doesn't give any bugs. I hope it helps!