Skip to content

Commit 8e85faf

Browse files
committed
adding L27 slide deck for bonus session
1 parent 174b707 commit 8e85faf

File tree

2 files changed

+10
-14
lines changed

2 files changed

+10
-14
lines changed

Lectures/W15-KVcahe-WMDP-Tools.pdf

12.4 MB
Binary file not shown.

_contents/S0-L27.md

Lines changed: 10 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
layout: post
33
title: Bonus session on KV Cache, Tooling and WMDP
4-
lecture:
4+
lecture: W15-KVcahe-WMDP-Tools
55
lectureVersion: current
66
extraContent:
77
tags:
@@ -17,23 +17,11 @@ categories:
1717

1818
### KV Caching in LLM:
1919

20-
+ Retentive Network: A Successor to Transformer for Large Language Models: https://arxiv.org/abs/2307.08621
21-
22-
+ https://arxiv.org/abs/2305.13048 RWKV: Reinventing RNNs for the Transformer Era
23-
2420
+ grouped query attention: https://arxiv.org/pdf/2305.13245.pdf
2521
+ Paged attention https://arxiv.org/pdf/2309.06180.pdf
2622
https://openreview.net/pdf?id=uNrFpDPMyo
2723

2824

29-
### Retentive Network: A Successor to Transformer for Large Language Models
30-
+ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more
31-
32-
33-
### RWKV: Reinventing RNNs for the Transformer Era
34-
+ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
35-
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more
36-
3725

3826
### The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
3927
+ Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Long Phan, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew B. Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, Bhrugu Bharathi, Adam Khoja, Zhenqi Zhao, Ariel Herbert-Voss, Cort B. Breuer, Andy Zou, Mantas Mazeika, Zifan Wang, Palash Oswal, Weiran Liu, Adam A. Hunt, Justin Tienken-Harder, Kevin Y. Shih, Kemper Talley, John Guan, Russell Kaplan, Ian Steneker, David Campbell, Brad Jokubaitis, Alex Levinson, Jean Wang, William Qian, Kallol Krishna Karmakar, Steven Basart, Stephen Fitz, Mindy Levine, Ponnurangam Kumaraguru, Uday Tupakula, Vijay Varadharajan, Yan Shoshitaishvili, Jimmy Ba, Kevin M. Esvelt, Alexandr Wang, Dan Hendrycks
@@ -72,7 +60,7 @@ Our approach leverages a linear attention mechanism and allows us to formulate t
7260

7361

7462

75-
63+
## More readings
7664

7765
### Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
7866
+ Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, Xia Hu
@@ -81,6 +69,14 @@ Our approach leverages a linear attention mechanism and allows us to formulate t
8169
+ https://github.com/Mooler0410/LLMsPracticalGuide
8270

8371

72+
### Retentive Network: A Successor to Transformer for Large Language Models
73+
+ In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation… Show more
74+
75+
76+
### RWKV: Reinventing RNNs for the Transformer Era
77+
+ Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
78+
Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transfor… Show more
79+
8480

8581
<!--excerpt.start-->
8682

0 commit comments

Comments
 (0)