Specscheduler evaluation support code #1541

goliaro · 2024-11-15T15:59:01Z

Description of changes:

This PR does the following:

LLAMA 3 speculation support:
- Add support for LLAMA 3.1 and 3.2
- Benchmark performance of LLAMA-3.1-70B with small models: Zhuominc/Llama-3-330M, meta-llama/Llama-3.2-1B-Instruct, meta-llama/Llama-3.2-3B-Instruct, meta-llama/Llama-3.1-8B-Instruct (tl;dr meta-llama/Llama-3.2-1B-Instruct is the best)
- Add support for serving SSMs with TP_degre > 1
Make evaluation easier/faster to run:
- Add code to load all the weights in parallel, fixing context issue discussed with Legion team here
- Record memory usage breakdown when passing --log-instance-creation. Add script to debug issues related to insufficient memory by device and task. See here.
Bug fixes
- Remove all reduce deadlock by adding Legion barriers
- Detection of EOS tokens when produced in the middle of speculation (instead of at the end) and early stop to prevent infinite generation (until max sequence length) when the EOS token is in middle of verified sequence
Benchmarking
- Added code to benchmark speculation accuracy and end-to-end performance for specinfer and incr decoding with various SSMs and arrival rates.
- Plots available below:
  ttft_vs_arrival_rate.pdf
  queueing_time_vs_arrival_rate.pdf
  throughput_vs_tpot.pdf
  average_accepted_tokens.pdf

Related Issues:

Linked Issues:

Issue #

Issues closed by this PR:

Closes #

This change is

sfc-gh-goliaro and others added 30 commits October 23, 2024 03:53

add suffix_decoding_code

f8e5352

less printf

51fee57

mistral, new sd script

53f5d14

update

7d6fb9c

fix: interleaving acc rate

f34eee2

update

78b0e0c

.

3026632

fix: minor

cde2fd1

init new suffix tree integration

3d82385

update

0a00694

fix

14dc4ff

add sd impl

eb07c36

update

07c0544

backup

da2fb99

backup

0652853

finish implementing new suffix decoder

edcc3ac

update

bf37429

update

89d8cf5

fix

d3f22b3

fix

727f363

backup

0da0aab

fix

0a1f543

update

52f1b12

metrics

3a396dd

update

9a44da3

add script

272d2f6

update

0b73fdd

update

dd578df

update

787982b

update

991d0cd

goliaro and others added 24 commits October 29, 2024 09:38

update

c052e06

update

84836b3

add script to benchmark incr dec

3a1b607

update

8bb2841

fix

563277f

update

04a4dc8

load weights in parallel

f5cce91

cleanup

0988bb4

memory debugging

80ffefc

fixup

bf48eec

support tp for draft model

30b3ac7

time to first token and queueing time

043cd40

warmup

66cb2a4

update incr dec script

e012e1e

update

366e7db

fix

58e0061

update

d1bbf1f

update

5b24e2d

add plot results

c94df85

add comments

4d9d4d3

Merge branch 'specscheduler' into specscheduler-evals

0ce2f14

update plots

f2937a6

remove suffix decoding stuff

333a795

fix

f0d8d69

goliaro marked this pull request as ready for review November 15, 2024 17:09

goliaro added 2 commits November 15, 2024 17:14

fix

3246967

fix

3a330ee

goliaro changed the base branch from specscheduler to specscheduler_eval November 15, 2024 17:16

goliaro merged commit b798385 into specscheduler_eval Nov 15, 2024
29 of 39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specscheduler evaluation support code #1541

Specscheduler evaluation support code #1541

goliaro commented Nov 15, 2024 •

edited

Loading

Specscheduler evaluation support code #1541

Specscheduler evaluation support code #1541

Conversation

goliaro commented Nov 15, 2024 • edited Loading

goliaro commented Nov 15, 2024 •

edited

Loading