Fix

vllm-project · Jun 20, 2023 · 7b069ed · 7b069ed
1 parent d5a1ee0
commit 7b069ed
Show file tree

Hide file tree

Showing 6 changed files with 18 additions and 18 deletions.
diff --git a/2023/06/20/introduction/index.html b/2023/06/20/introduction/index.html
@@ -142,19 +142,19 @@ <h2 id="the-secret-sauce-pagedattention">The Secret Sauce: PagedAttention</h2>
 
 <h2 id="the-silent-hero-behind-lmsys-vicuna-and-chatbot-arena">The Silent Hero Behind LMSYS Vicuna and Chatbot Arena</h2>
 
-<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 7x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM inference backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>
+<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 5x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM serving backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>
 
-<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the chat frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPU resources to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>
+<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>
 
 <p align="center">
 <picture>
 <img src="assets/figures/lmsys_traffic.png" width="100%" />
 </picture>
 <br />
-Chat sessions served by vLLM in the Chatbot Arena between April to May. Indeed, more than half of the chat sessions in Chatbot Arena use vLLM as the inference engine.
+Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.
 </p>
 
-<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K chat sessions daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>
+<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>
 
 <h2 id="get-started-with-vllm">Get started with vLLM</h2>
 
@@ -205,7 +205,7 @@ <h2 id="get-started-with-vllm">Get started with vLLM</h2>
 
       <footer class="footer">
         <small>
-          &copy; <time datetime="2023-06-21T03:02:00+08:00">2023</time>. vLLM Team. All rights reserved.
+          &copy; <time datetime="2023-06-21T04:02:18+08:00">2023</time>. vLLM Team. All rights reserved.
         </small>
       </footer>
     </div>

diff --git a/404.html b/404.html
@@ -48,7 +48,7 @@ <h1 class="page-title">404: Page not found</h1>
 
       <footer class="footer">
         <small>
-          &copy; <time datetime="2023-06-21T03:02:00+08:00">2023</time>. vLLM Team. All rights reserved.
+          &copy; <time datetime="2023-06-21T04:02:18+08:00">2023</time>. vLLM Team. All rights reserved.
         </small>
       </footer>
     </div>

diff --git a/about/index.html b/about/index.html
@@ -76,7 +76,7 @@ <h2 id="setup">Setup</h2>
 
       <footer class="footer">
         <small>
-          &copy; <time datetime="2023-06-21T03:02:00+08:00">2023</time>. vLLM Team. All rights reserved.
+          &copy; <time datetime="2023-06-21T04:02:18+08:00">2023</time>. vLLM Team. All rights reserved.
         </small>
       </footer>
     </div>

diff --git a/archive/index.html b/archive/index.html
@@ -55,7 +55,7 @@ <h2>June 2023</h2>
 
       <footer class="footer">
         <small>
-          &copy; <time datetime="2023-06-21T03:02:00+08:00">2023</time>. vLLM Team. All rights reserved.
+          &copy; <time datetime="2023-06-21T04:02:18+08:00">2023</time>. vLLM Team. All rights reserved.
         </small>
       </footer>
     </div>

diff --git a/atom.xml b/atom.xml
@@ -4,7 +4,7 @@
  <title></title>
  <link href="/atom.xml" rel="self"/>
  <link href="https://vllm.ai/"/>
- <updated>2023-06-21T03:02:00+08:00</updated>
+ <updated>2023-06-21T04:02:18+08:00</updated>
  <id>https://vllm.ai</id>
  <author>
    <name>vLLM Team</name>
@@ -116,19 +116,19 @@ Example generation process for a request that samples multiple outputs.
 
 &lt;h2 id=&quot;the-silent-hero-behind-lmsys-vicuna-and-chatbot-arena&quot;&gt;The Silent Hero Behind LMSYS Vicuna and Chatbot Arena&lt;/h2&gt;
 
-&lt;p&gt;This April, &lt;a href=&quot;https://lmsys.org&quot;&gt;LMSYS&lt;/a&gt; developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in &lt;a href=&quot;https://arena.lmsys.org/&quot;&gt;Chatbot Arena&lt;/a&gt; for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py&quot;&gt;serving backend&lt;/a&gt; to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py&quot;&gt;as the new backend&lt;/a&gt; in order to support the growing demands (up to 7x more traffic). In an early &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py&quot;&gt;internal micro-benchmark&lt;/a&gt; by LMSYS, the vLLM inference backend can &lt;strong&gt;achieve up to 30x higher throughput than an initial HF backend.&lt;/strong&gt;&lt;/p&gt;
+&lt;p&gt;This April, &lt;a href=&quot;https://lmsys.org&quot;&gt;LMSYS&lt;/a&gt; developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in &lt;a href=&quot;https://arena.lmsys.org/&quot;&gt;Chatbot Arena&lt;/a&gt; for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py&quot;&gt;serving backend&lt;/a&gt; to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py&quot;&gt;as the new backend&lt;/a&gt; in order to support the growing demands (up to 5x more traffic). In an early &lt;a href=&quot;https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py&quot;&gt;internal micro-benchmark&lt;/a&gt; by LMSYS, the vLLM serving backend can &lt;strong&gt;achieve up to 30x higher throughput than an initial HF backend.&lt;/strong&gt;&lt;/p&gt;
 
-&lt;p&gt;Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the chat frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPU resources to serve Vicuna to millions of users with &lt;em&gt;high throughput&lt;/em&gt; and &lt;em&gt;low latency&lt;/em&gt;. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The &lt;a href=&quot;https://vllm.readthedocs.io/en/latest/models/supported_models.html&quot;&gt;support for more models&lt;/a&gt; is being developed and forthcoming.&lt;/p&gt;
+&lt;p&gt;Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with &lt;em&gt;high throughput&lt;/em&gt; and &lt;em&gt;low latency&lt;/em&gt;. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The &lt;a href=&quot;https://vllm.readthedocs.io/en/latest/models/supported_models.html&quot;&gt;support for more models&lt;/a&gt; is being developed and forthcoming.&lt;/p&gt;
 
 &lt;p align=&quot;center&quot;&gt;
 &lt;picture&gt;
 &lt;img src=&quot;assets/figures/lmsys_traffic.png&quot; width=&quot;100%&quot; /&gt;
 &lt;/picture&gt;
 &lt;br /&gt;
-Chat sessions served by vLLM in the Chatbot Arena between April to May. Indeed, more than half of the chat sessions in Chatbot Arena use vLLM as the inference engine.
+Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.
 &lt;/p&gt;
 
-&lt;p&gt;This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K chat sessions daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.&lt;/p&gt;
+&lt;p&gt;This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.&lt;/p&gt;
 
 &lt;h2 id=&quot;get-started-with-vllm&quot;&gt;Get started with vLLM&lt;/h2&gt;
 

diff --git a/index.html b/index.html
@@ -140,19 +140,19 @@ <h2 id="the-secret-sauce-pagedattention">The Secret Sauce: PagedAttention</h2>
 
 <h2 id="the-silent-hero-behind-lmsys-vicuna-and-chatbot-arena">The Silent Hero Behind LMSYS Vicuna and Chatbot Arena</h2>
 
-<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 7x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM inference backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>
+<p>This April, <a href="https://lmsys.org">LMSYS</a> developed the popular Vicuna chatbot models and made them publicly available. Since then, Vicuna has been served in <a href="https://arena.lmsys.org/">Chatbot Arena</a> for millions of users. Initially, LMSYS FastChat adopted a HF Transformers based <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/model_worker.py">serving backend</a> to serve the chat demo. As the demo became more popular, the peak traffic ramped up several times, making the HF backend a significant bottleneck. The LMSYS and vLLM team have worked together and soon developed the FastChat-vLLM integration to use vLLM <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/vllm_worker.py">as the new backend</a> in order to support the growing demands (up to 5x more traffic). In an early <a href="https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/test_throughput.py">internal micro-benchmark</a> by LMSYS, the vLLM serving backend can <strong>achieve up to 30x higher throughput than an initial HF backend.</strong></p>
 
-<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the chat frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPU resources to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>
+<p>Since mid-April, the most popular models such as Vicuna, Koala, and LLaMA, have all been successfully served using the FastChat-vLLM integration – With FastChat as the multi-model chat serving frontend and vLLM as the inference backend, LMSYS is able to harness a limited number of university-sponsored GPUs to serve Vicuna to millions of users with <em>high throughput</em> and <em>low latency</em>. LMSYS is expanding the use of vLLM to a wider range of models, including Databricks Dolly, LAION’s OpenAsssiant, and Stability AI’s stableLM. The <a href="https://vllm.readthedocs.io/en/latest/models/supported_models.html">support for more models</a> is being developed and forthcoming.</p>
 
 <p align="center">
 <picture>
 <img src="assets/figures/lmsys_traffic.png" width="100%" />
 </picture>
 <br />
-Chat sessions served by vLLM in the Chatbot Arena between April to May. Indeed, more than half of the chat sessions in Chatbot Arena use vLLM as the inference engine.
+Requests served by FastChat-vLLM integration in the Chatbot Arena between April to May. Indeed, more than half of the requests to Chatbot Arena use vLLM as the inference backend.
 </p>
 
-<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K chat sessions daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>
+<p>This utilization of vLLM has also significantly reduced operational costs. With vLLM, LMSYS was able to cut the number of GPUs used for serving the above traffic by 50%. vLLM has been handling an average of 30K requests daily and a peak of 60K, which is a clear demonstration of vLLM’s robustness.</p>
 
 <h2 id="get-started-with-vllm">Get started with vLLM</h2>
 
@@ -211,7 +211,7 @@ <h2 id="get-started-with-vllm">Get started with vLLM</h2>
 
       <footer class="footer">
         <small>
-          &copy; <time datetime="2023-06-21T03:02:00+08:00">2023</time>. vLLM Team. All rights reserved.
+          &copy; <time datetime="2023-06-21T04:02:18+08:00">2023</time>. vLLM Team. All rights reserved.
         </small>
       </footer>
     </div>