Skip to content

Latest commit

 

History

History
129 lines (62 loc) · 8.11 KB

README.md

File metadata and controls

129 lines (62 loc) · 8.11 KB

Lilith

Using the Lilith optimizer on nanogpt, messing with lr, and multiple schdulers

deepseek step based implementation -> link

Running tests

New lilith versions

  • Test 26, setting dropout to low values(~0.01) is beneficial, but linear descent compared to the smooth adam curves

Screen Shot 2024-02-22 at 12 12 05 PM

  • Test 25, deepseek scheduler 2:4:4 and 8:1:1, at acc 1000, also these runs now finishing in like 1/4 the time is really useful!

Screen Shot 2024-02-21 at 5 44 56 PM

  • Test 24, graphs for acc=10, acc=50, acc=1000, helps boost training a little bit early on, slightly better curves, slightly lower loss, might just blow up this value for a +1% boost. Here Lilith runs with accelration and a bs 48 come really close to an AdamW run with bs 180, Lilith train time here was almost 5x faster, 70ms per step vs 300ms per step, 4x less mem for batches while 5x faster.

Screen Shot 2024-02-21 at 4 22 53 PM

  • Test 23, accelration at 4, matches acc=2, and yet these values are looking they match larger bs, this optimizer is fire

Screen Shot 2024-02-21 at 2 31 37 PM

  • Test 22, match beta1_m to adam beta 1 and beta_v near adams beta 2, also trying accelration set to 2, here in the graph we have an overfit AdamW vs Lilith with accelration=2, following the same path with less overfitting

Screen Shot 2024-02-21 at 2 07 16 PM

Test 21, bs=600, My setup cant see batch_sizes of 600+ with ooms, Lilith is like 10% faster, interstingly, faster and equal or better?!

Screen Shot 2024-02-21 at 10 46 54 AM

  • Test 20, Adam can match lilith at bs=180, testing bs=360 (Yellow and Orange)

Screen Shot 2024-02-20 at 11 08 44 PM Screen Shot 2024-02-20 at 11 43 57 PM

  • Test 19, scaling batchsize to 360, appears to be having a similar effect so far, but better, explains euclaise's tests, his bs=1024

Screen Shot 2024-02-20 at 1 33 30 PM

  • Test 18, scaling batchsize to 180 for a try, lr 3e-4, cosine schedule, sota result by a margin, beats adam?! It shows the same behaviour as adamw on large batches, but better? This could be the large scale training optimizer?

Screen Shot 2024-02-20 at 12 11 06 PM

  • Test 17, using the deepseek step bases again, first graph 2:4:4, second graph 8:1:1, 8:1:1 is a really successful scheduler, achieved the same val loss as cosine adamw

Screen Shot 2024-02-20 at 11 18 09 AM Screen Shot 2024-02-20 at 11 28 17 AM

  • brand new version, due to corruption lost the graphs, but the new good lr is 3e-4, from test 16

  • Test 15, Trying the deepseek based lr steps once again, 2:4:4 (first graph, lr 1e-4, due to numerical instability) and 8:1:1 (second graph, lr 8e-5), the first step change in 2:4:4 worked, but it flatlined afterwards, some progress on that end, while lr on the deepseek values was much much better, almost cosine

Screen Shot 2024-02-19 at 7 45 18 PM

Screen Shot 2024-02-19 at 7 54 41 PM

  • Test 14, set beta 1 and beta 2 to 0.95 and 0.98, slightly worse, and trial of 0.98 and 0.999999 was even worse but good tuning might give a +1% boost,

Screen Shot 2024-02-19 at 6 55 46 PM

Screen Shot 2024-02-19 at 7 16 49 PM

  • Test 13, lr 8e-5, was initially 5e-5, but it was too low, couldnt affect it very well, 8e-5 appears to be an even better initial sweet spot than 1e-4, tho it starts converging

download (17)

  • Test 12, the same as 9, but just testing batch_size and lowering iters for efficiency, slightly above the sota run, but thats expected from larger batches, trains on 1.2x more tokens than before, for 1/3 the time, lilith is scalable, just like AdamW

download (16)

  • Test 11, changed ema_k from 0 to 1 for better numerical stability, and using cosine lr schedule, lr = 1e-3

  • Note: There is numerical stability, no Nans, but loss is very volatile, literally unlearning

  • Test 10, using Triangular lr schedule, literally doesnt want to work, just like the previous tlr spike, gonna stick with multistep or cosine

download (15)

  • Test 9, the orange bar being the new lilith, lr=1e-4, cosine scheduler, literally matches adamW for awhile, before flattening earlier, but val losses match, at ~1.47, so maybe its just not as prone to overfit?

download (13)

Old lilith versions

  • Test 1, Lilith default params, using cosine LR, AdamW params from Karpathy, cosine LR

download (1)

  • Test 2, Lilith some slight LR changes(lr 1e-2), using TLR, AdamW params from Karpathy, cosine LR

download (2)

  • Test 3, Lilith lR (3e-4), using cosine lr, adamw the same

download (6)

  • Test 4, current lilith in blue, lr (1e-4), cosine lr

download (7)

  • Test 5, current lilith in green, lr (5e-5), cosine lr, too low, and the model cant seem to get as low as adamw
  • further tests to try and reintroduce TLR, then try a deepseek style stepwise lr

download (9)

  • Test 6, TRL reintroduction(pink), vs sota lilith (blue), and adamW (red), lr 1e-4, didn't go well, TRL is too unstable, will try deepseek stepbased lr later

download (10)

  • Test 7, using the deepseek based lr, in yellow, lr 1e-4, 20%,40%,40% partitions, didn't do anything, but that just maybe my infamiliarity with the step based version

download (11)

  • Test 8, using the same step partitions in the deepseek paper, teal line, lr 1e-4, 80%,10%,10% partitions, I need to fix it, the lr freaks out and goes to zero, but this optimizer does not seem to like the scheduler whatsoever either, literally no change/drop in all cases

download (12)