A simple toy implementation of GRPO with LLMs.
$ git clone https://github.com/Ktakuya332C/simple-rlvr.git
$ cd simple-rlvr
$ poetry install
$ # Below command will run, but scores won't go up because the scale is too small
$ poetry run python -m rlvr.main \
--model=sbintuitions/tiny-lm-chat \
--num-rollout-workers=2 \
--num-reference-workers=2 \
--num-grpo-learners=2 \
--batch-size-sync=16 \
--batch-size-update=8 \
--batch-size-backward=4 \
--batch-size-rollout=2 \
--batch-size-reference=2 \
--num-generations=2 \
--max-length=512 \
--temperature=1.0
$ poetry run black .
$ poetry run pytest -xsvv tests
- You may need to set
GLOO_SOCKET_IFNAME=lo0
to run this script on Mac. - This design is largely influenced by RLlib Flow.
- missing eos penalty
- bf16 (numpy does not support bf16)
- FSDP (DDP works without GPUs, but FSDP does not)
- vllm integration (
collective_rpc
support is not yet released) - large scale experiments
- tensor parallel