You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to reproduce the BoN result in the paper but failed to do so. I would like to ask if anyone has successfully reproduced it. And, I also list my steps to obtain the BoN result in this post.
First, let me show my BoN result.
This result should be compared with the single RM curves in Figure 3a of the paper. My BoN result is different from the paper's result in two ways: 1) the initial gold RM scores are different, and 2) the final gold RM scores are different.
from alpaca_farm_pref to custom_hf_pref because the paper used the latter dataset for RM training.
The RM training loss is invariant to offsets, meaning that shifting an RM's outputs by any constant offset will not change the RM loss. Therefore the resulting RMs have different offsets and we should remove them. In addition, the author suggested that the centered rewards are further divided by the estimated standard deviation in Clarification: No Centering / Scaling / Standardizing of Ensembles' Rewards? #12. I did this normalization in this repo, not in the open-assistant codebase as suggested by the author. I think effectively my way and the author's way are the same. Specifically, I replaced the get_reward function in
def get_reward(
samples,
reward_models,
reward_tokenizer,
reward_device, # needed?
batch_size,
objective_function=None,
weight=None,
is_alpacafarm_rm=False,
normalize_reward=True,
):
if not isinstance(reward_models, list):
reward_models = [reward_models]
input = reward_tokenizer(
samples,
padding=True,
truncation=True,
max_length=MAX_LEN,
return_tensors="pt",
).to(reward_device)
all_rewards = []
for reward_model in reward_models:
out = []
for i in range(math.ceil(len(samples) / batch_size)):
batch_ixs = slice(i * batch_size, (i + 1) * batch_size)
input_ids = input.input_ids[batch_ixs]
attention_mask = input.attention_mask[batch_ixs]
output = reward_model(input_ids, attention_mask)
rewards = output.rewards if is_alpacafarm_rm else output.logits[:, 0]
out.extend(rewards)
all_rewards.append(torch.hstack(out))
if len(all_rewards) == 1:
all_rewards = all_rewards[0]
# add normalization here
if normalize_reward:
all_rewards = (all_rewards - reward_models[0].config.mean) / reward_models[0].config.std
return all_rewards, torch.empty_like(all_rewards)
# add normalization here
if normalize_reward:
for i in range(len(reward_models)):
all_rewards[i] = (all_rewards[i] - reward_models[i].config.mean) / reward_models[i].config.std
all_rewards = torch.stack(all_rewards, 0)
var = torch.var(all_rewards, dim=0)
if objective_function:
all_rewards = objective_function(all_rewards, weight)
return all_rewards, var
In addition, when training RMs, the RM scores should not be normalized, so let's put rewards, _ = get_reward(samples, model, tokenizer, model.device, batch_size=128, normalize_reward=False) in
.
5. To avoid the error mentioned in #7, change gold_labelled_generations.map(_truncate_answers)
to gold_labelled_generations.map(_truncate_answers, batched=True, batch_size=10) in
Training RMs: Run accelerate launch --config_file configs/accelerate_config.yaml src/reward_modeling/training/trainer_rm.py --configs defaults_rm rm-pythia-44m --rng_seed <seed> for 5 times, with <seed> being 1, 2, 3, 4, and 5. The final summary for seed 1 is
I have added the residual_dropout_lima: false to the config
This was just an example, but I have changed the dataset to match the one used in our experiments
I have added the batching parameters to the .map, following several user reports of this issue. In terms of your results, from what I can see, one difference is that we ensure the starting policy is at 0 gold reward. This should make the initial and final gold scores of your curves more aligned with those in the paper. Furthermore, it's difficult to know because you might have a slightly different set-up and parameters, but if you are struggling to see evidence of overoptimisation in your experiments, I would advise adding label noise.
I tried to reproduce the BoN result in the paper but failed to do so. I would like to ask if anyone has successfully reproduced it. And, I also list my steps to obtain the BoN result in this post.
First, let me show my BoN result.
This result should be compared with the single RM curves in Figure 3a of the paper. My BoN result is different from the paper's result in two ways: 1) the initial gold RM scores are different, and 2) the final gold RM scores are different.
Here are the changes that I made to the codebase.
residual_dropout_lima: false
tollm_optimization/configs/config_rm.yaml
Line 49 in f8a9ae6
llm_optimization/configs/config_rm.yaml
Line 54 in f8a9ae6
alpaca_farm_pref
tocustom_hf_pref
because the paper used the latter dataset for RM training.get_reward
function inllm_optimization/src/reward_modeling/scoring/score.py
Line 18 in f8a9ae6
In addition, when training RMs, the RM scores should not be normalized, so let's put
rewards, _ = get_reward(samples, model, tokenizer, model.device, batch_size=128, normalize_reward=False)
inllm_optimization/src/reward_modeling/training/trainer_rm.py
Line 341 in f8a9ae6
5. To avoid the error mentioned in #7, change
gold_labelled_generations.map(_truncate_answers)
to
gold_labelled_generations.map(_truncate_answers, batched=True, batch_size=10)
inllm_optimization/src/bon/run_bon_pipeline.py
Line 77 in f8a9ae6
My steps to obtain the result:
accelerate launch --config_file configs/accelerate_config.yaml src/reward_modeling/training/trainer_rm.py --configs defaults_rm rm-pythia-44m --rng_seed <seed>
for 5 times, with<seed>
being 1, 2, 3, 4, and 5. The final summary for seed 1 ispython src/bon/run_bon_pipeline.py models/rm-pythia-44m_seed{seed} --seeds 1,2,3,4,5 --ensembles
The text was updated successfully, but these errors were encountered: