Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add intermediate_size to GPT-NeoX models #1212

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

dtamayo-nlp
Copy link

@dtamayo-nlp dtamayo-nlp commented May 10, 2024

The current implementation only allows to set intermediate_size for Llama models, but I would like to be capable to change the intermediate_size in GPT-NeoX models.

I have tested this implementation under a quick training, inference and conversion and it seems that it doesn't give any bugs. I hope it helps!

@CLAassistant
Copy link

CLAassistant commented May 10, 2024

CLA assistant check
All committers have signed the CLA.

@Quentin-Anthony
Copy link
Member

This looks good. Need to make it consistent with mamba and RWKV. We also need some TODO statements about revisiting this once we add swiglu.

@jahatef is on it

@jahatef
Copy link
Collaborator

jahatef commented Jun 19, 2024

Added support for mamba and RWKV, and added TODOs.

@StellaAthena
Copy link
Member

With these changes, is there a point to having separate Linear and LLaMA Linear definitions? At a glance it looks like all the differences are configurable, with the only difference being what is assumed as the default behavior if unspecified.

@jahatef
Copy link
Collaborator

jahatef commented Jul 25, 2024

Refactored the activations and MLP layer to get rid of our redundant llama mlp class, and added some activations functions from https://arxiv.org/pdf/2002.05202.

@@ -3,7 +3,7 @@
# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe_parallel_size": 1,
"model_parallel_size": 1,
"model_parallel_size": 2,
Copy link
Member

@Quentin-Anthony Quentin-Anthony Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not change the existing configs unless necessary

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The llama config changes are fine, to be clear

# set variables, mostly following mamba defaults
self.d_model = neox_args.hidden_size
self.d_state = 16 # state dimensions per channel
self.d_conv = 4 # convolution width
self.expand = 2 # linear projection expansion factors
self.d_inner = int(self.expand * self.d_model)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neox_args.d_inner = neox_args.intermediate_size
if neox_args.expansion_factor:
self.expand = neox_args.expansion_factor
neox_args.d_inner = neox_args.intermediate_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neox_args.d_inner is set both in the if-statement above, and also here?

@@ -247,11 +247,11 @@ def __init__(self, neox_args, layer_number):
self.time_maa_k = nn.Parameter(1.0 - torch.pow(ddd, ratio_1_to_almost0))
self.time_maa_r = nn.Parameter(1.0 - torch.pow(ddd, ratio_1_to_almost0))

self.key = nn.Linear(neox_args.hidden_size, neox_args.dim_ffn, bias=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer ffn to ff. Also, why are we swapping the ordering of dim and ffn in the first place?

ff_dim = int(2 * neox_args.hidden_size * 4 / 3)
ff_dim = self.multiple_of * ((ff_dim + multiple_of - 1) // multiple_of)

self.w1 = mpu.ColumnParallelLinear(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to test that tensor parallelism and MoE are maintained

or self.num_experts > 1
and self.moe_type == "deepspeed"
):
if self.num_experts > 1 and self.moe_type == "deepspeed":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we still be testing if this mlp is a swiglu?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants