Skip to content

Ktakuya332C/simple-rlvr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simple-rlvr

A simple toy implementation of GRPO with LLMs.

Usage

$ git clone https://github.com/Ktakuya332C/simple-rlvr.git
$ cd simple-rlvr
$ poetry install
$ # Below command will run, but scores won't go up because the scale is too small
$ poetry run python -m rlvr.main \
  --model=sbintuitions/tiny-lm-chat \
  --num-rollout-workers=2 \
  --num-reference-workers=2 \
  --num-grpo-learners=2 \
  --batch-size-sync=16 \
  --batch-size-update=8 \
  --batch-size-backward=4 \
  --batch-size-rollout=2 \
  --batch-size-reference=2 \
  --num-generations=2 \
  --max-length=512 \
  --temperature=1.0

Development

$ poetry run black .
$ poetry run pytest -xsvv tests

Note

  • You may need to set GLOO_SOCKET_IFNAME=lo0 to run this script on Mac.
  • This design is largely influenced by RLlib Flow.

ToDo

  • missing eos penalty
  • bf16 (numpy does not support bf16)
  • FSDP (DDP works without GPUs, but FSDP does not)
  • vllm integration (collective_rpc support is not yet released)
  • large scale experiments
  • tensor parallel

About

A toy implementation of GRPO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages