Skip to content

Latest commit

 

History

History
681 lines (392 loc) · 76.4 KB

2022.md

File metadata and controls

681 lines (392 loc) · 76.4 KB

2022 (50 papers)

  1. Language Models (Mostly) Know What They Know, Saurav Kadavath,Tom Conerly,Amanda Askell,Tom Henighan,Dawn Drain,Ethan Perez,Nicholas Schiefer,Zac Hatfield-Dodds,Nova DasSarma,Eli Tran-Johnson,Scott Johnston,Sheer El-Showk,Andy Jones,Nelson Elhage,Tristan Hume,Anna Chen,Yuntao Bai,Sam Bowman,Stanislav Fort,Deep Ganguli,Danny Hernandez,Josh Jacobson,Jackson Kernion,Shauna Kravec,Liane Lovitt,Kamal Ndousse,Catherine Olsson,Sam Ringer,Dario Amodei,Tom Brown,Jack Clark,Nicholas Joseph,Ben Mann,Sam McCandlish,Chris Olah,Jared Kaplan, 11-07-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.

    Bullet Points

    • Study language models' ability to evaluate claims and predict correct answers

    • Larger models are well-calibrated on diverse multiple choice and true/false questions when provided in the right format

    • Self-evaluation on open-ended sampling tasks is encouraged

    • Models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer

    • Predicting P(iK) probabilities increases appropriately in the presence of relevant source materials and hints towards the solution of mathematical word problems

    • These observations help to train more honest models and investigate how honesty generalizes to cases where models are trained on objective other than human writing.

  2. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents, Wenlong Huang,Pieter Abbeel,Deepak Pathak,Igor Mordatch, 18-01-2022

    Categories

    Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision, Robotics

    Abstract

  3. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei,Xuezhi Wang,Dale Schuurmans,Maarten Bosma,Brian Ichter,Fei Xia,Ed Chi,Quoc Le,Denny Zhou, 28-01-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

    Bullet Points

    • Generating a chain of thought improves the ability of large language models to perform complex reasoning by providing a series of intermediate reasoning steps as exemplars in prompting

    • This method improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks

    • Experiments on three large model models show that chain-of-thought prompting can achieve state of the art accuracy on math word problems, surpassing even finetuned GPT-3 with a verifier.

  4. Training language models to follow instructions with human feedback, Long Ouyang,Jeff Wu,Xu Jiang,Diogo Almeida,Carroll L. Wainwright,Pamela Mishkin,Chong Zhang,Sandhini Agarwal,Katarina Slama,Alex Ray,John Schulman,Jacob Hilton,Fraser Kelton,Luke Miller,Maddie Simens,Amanda Askell,Peter Welinder,Paul Christiano,Jan Leike,Ryan Lowe, 04-03-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

    Bullet Points

    • Fine-tuning with human feedback is a promising direction for aligning language models with user intent on a wide range of tasks, as large language models can generate untruthful, toxic, or not helpful outputs

    • In this paper, we use labeler-written prompts and prompts submitted through the OpenAI API to fine-tune GPT-3 using supervised learning and ranking of model outputs to further refine this supervised model using reinforcement learning from human feedback

    • InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets.

  5. Self-Consistency Improves Chain of Thought Reasoning in Language Models, Xuezhi Wang,Jason Wei,Dale Schuurmans,Quoc Le,Ed Chi,Sharan Narang,Aakanksha Chowdhery,Denny Zhou, 21-03-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

    Bullet Points

    • The paper proposes a new decoding strategy, self-consistency, to replace naive greedy decoding used in chain-of-thought prompting

    • It first samples a diverse set of reasoning paths and selects the most consistent answer by marginalizing out the sampled reasoning paths

    • Self-consistent leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer

    • This boosts the performance of chain-opinion prompting with a striking margin on popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).

  6. Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language, Andy Zeng,Maria Attarian,Brian Ichter,Krzysztof Choromanski,Adrian Wong,Stefan Welker,Federico Tombari,Aveek Purohit,Michael Ryoo,Vikas Sindhwani,Johnny Lee,Vincent Vanhoucke,Pete Florence, 01-04-2022

    Categories

    Computer Vision, Artificial Intelligence, Computation and Language, Machine Learning

    Abstract

    Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.

    Bullet Points

    • Large pretrained models exhibit distinct capabilities depending on the domain of data they are trained on

    • These models store different forms of commonsense knowledge across different domains and can be leveraged through Socratic Models (SMs), a modular framework that can be composed zero-shot via multimodal-informed prompting to exchange information with each other and capture new multimodal capabilities without requiring finetuning

    • SMs enable new applications such as answering free-form questions about egocentric video, engaging in multimodal assistive dialogue with people by interfacing with external APIs and databases, and robot perception and planning.

  7. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Michael Ahn,Anthony Brohan,Noah Brown,Yevgen Chebotar,Omar Cortes,Byron David,Chelsea Finn,Chuyuan Fu,Keerthana Gopalakrishnan,Karol Hausman,Alex Herzog,Daniel Ho,Jasmine Hsu,Julian Ibarz,Brian Ichter,Alex Irpan,Eric Jang,Rosario Jauregui Ruano,Kyle Jeffrey,Sally Jesmonth,Nikhil J Joshi,Ryan Julian,Dmitry Kalashnikov,Yuheng Kuang,Kuang-Huei Lee,Sergey Levine,Yao Lu,Linda Luu,Carolina Parada,Peter Pastor,Jornell Quiambao,Kanishka Rao,Jarek Rettinghouse,Diego Reyes,Pierre Sermanet,Nicolas Sievers,Clayton Tan,Alexander Toshev,Vincent Vanhoucke,Fei Xia,Ted Xiao,Peng Xu,Sichun Xu,Mengyuan Yan,Andy Zeng, 04-04-2022

    Categories

    Robotics, Computation and Language, Machine Learning

    Abstract

    Bullet Points

  8. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, Yuntao Bai,Andy Jones,Kamal Ndousse,Amanda Askell,Anna Chen,Nova DasSarma,Dawn Drain,Stanislav Fort,Deep Ganguli,Tom Henighan,Nicholas Joseph,Saurav Kadavath,Jackson Kernion,Tom Conerly,Sheer El-Showk,Nelson Elhage,Zac Hatfield-Dodds,Danny Hernandez,Tristan Hume,Scott Johnston,Shauna Kravec,Liane Lovitt,Neel Nanda,Catherine Olsson,Dario Amodei,Tom Brown,Jack Clark,Sam McCandlish,Chris Olah,Ben Mann,Jared Kaplan, 12-04-2022

    Categories

    Computation and Language, Machine Learning

    Abstract

    We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

    Bullet Points

    • We use preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants

    • This alignment training improves performance on almost all NLP evaluations and is compatible with training for specialized skills

    • We explore an iterated online mode of training where preference models and RL policies are updated on a weekly cadence with fresh human feedback data

    • We investigate the robustness of RLHF training and identify a linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization

    • Additionally, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

  9. Inferring Implicit Relations in Complex Questions with Language Models, Uri Katz,Mor Geva,Jonathan Berant, 28-04-2022

    Categories

    Computation and Language

    Abstract

    A prominent challenge for modern language understanding systems is the ability to answer implicit reasoning questions, where the required reasoning steps for answering the question are not mentioned in the text explicitly. In this work, we investigate why current models struggle with implicit reasoning question answering (QA) tasks, by decoupling inference of reasoning steps from their execution. We define a new task of implicit relation inference and construct a benchmark, IMPLICITRELATIONS, where given a question, a model should output a list of concept-relation pairs, where the relations describe the implicit reasoning steps required for answering the question. Using IMPLICITRELATIONS, we evaluate models from the GPT-3 family and find that, while these models struggle on the implicit reasoning QA task, they often succeed at inferring implicit relations. This suggests that the challenge in implicit reasoning questions does not stem from the need to plan a reasoning strategy alone, but to do it while also retrieving and reasoning over relevant information.

    Bullet Points

    • The work investigates why current models struggle with implicit reasoning question answering (QA) tasks by decoupling inference of reasoning steps from their execution

    • We define a new task of implicit relation inference and construct a benchmark, IMPLICITRELATIONS, where given a question, a model should output a list of concept-relation pairs, where the relations describe the implicit reasoning steps required for answering the question

    • We evaluate models from the GPT-3 family and find that they often succeed at inferring implicit relations

    • This suggests that the challenge in implicit reasoning questions does not stem from planning a reasoning strategy alone, but rather to do it while retrieving and reasoning over relevant information.

  10. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning, Xi Ye,Greg Durrett, 06-05-2022

    Categories

    Computation and Language

    Abstract

    We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good--logically consistent with the input and the prediction--more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets.

    Bullet Points

    • LLM explanations may not be factually grounded in the input, even on simple tasks with extractive explanations

    • However, flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc

    • Human explanations judged to be good are more likely to cooccur with accurate predictions

    • We train calibrators using automated scores to assess the reliability of explanations, allowing us to improve performance on all of our datasets.

  11. UL2: Unifying Language Learning Paradigms, Yi Tay,Mostafa Dehghani,Vinh Q. Tran,Xavier Garcia,Jason Wei,Xuezhi Wang,Hyung Won Chung,Siamak Shakeri,Dara Bahri,Tal Schuster,Huaixiu Steven Zheng,Denny Zhou,Neil Houlsby,Donald Metzler, 10-05-2022

    Categories

    Computation and Language

    Abstract

    Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.

    Bullet Points

    • The paper presents a unified framework for pre-trained models that are universally effective across datasets and setups

    • It disentangles architectural archetypes with pre-training objectives and proposes Mixture-of-Denoisers (MoD) and mode switching

    • The model achieves SOTA performance on 50 well-established supervised finetuning based NLP tasks and achieves strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization

    • On 0-shot MMLU, UL2 20B outperforms T0 and T5 models, and works well with chain-of thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters

    • Flax-based T5X checkpoint

  12. A Generalist Agent, Scott Reed,Konrad Zolna,Emilio Parisotto,Sergio Gomez Colmenarejo,Alexander Novikov,Gabriel Barth-Maron,Mai Gimenez,Yury Sulsky,Jackie Kay,Jost Tobias Springenberg,Tom Eccles,Jake Bruce,Ali Razavi,Ashley Edwards,Nicolas Heess,Yutian Chen,Raia Hadsell,Oriol Vinyals,Mahyar Bordbar,Nando de Freitas, 12-05-2022

    Categories

    Artificial Intelligence, Computation and Language, Machine Learning, Robotics

    Abstract

    Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.

    Bullet Points

    • Gato is a multi-modal, multitask, multi-embodiment generalist agent that works beyond text outputs

    • It can play Atari, caption images, chat, stack blocks with a real robot arm, decide based on its context whether to output text, joint torques, button presses, or other tokens

    • The network has the same weights and can play a variety of games such as Atari Capture, Caption Images, Chat, and Stack Blocks.

  13. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning, Antonia Creswell,Murray Shanahan,Irina Higgins, 19-05-2022

    Categories

    Artificial Intelligence, Computation and Language

    Abstract

    Large language models (LLMs) have been shown to be capable of impressive few-shot generalisation to new tasks. However, they still tend to perform poorly on multi-step logical reasoning problems. Here we carry out a comprehensive evaluation of LLMs on 50 tasks that probe different aspects of logical reasoning. We show that language models tend to perform fairly well at single step inference or entailment tasks, but struggle to chain together multiple reasoning steps to solve more complex problems. In light of this, we propose a Selection-Inference (SI) framework that exploits pre-trained LLMs as general processing modules, and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer. We show that a 7B parameter LLM used within the SI framework in a 5-shot generalisation setting, with no fine-tuning, yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10 logical reasoning tasks. The same model in the same setting even outperforms a significantly larger 280B parameter baseline on the same suite of tasks. Moreover, answers produced by the SI framework are accompanied by a causal natural-language-based reasoning trace, which has important implications for the safety and trustworthiness of the system.

    Bullet Points

    • LLMs can generalize to new tasks, but perform poorly on multi-step logical reasoning problems

    • A selection-inference framework is proposed that uses pre-trained LLM as general processing modules and alternates between selection and inference to generate a series of interpretable, casual reasoning steps leading to the final answer

    • A 7B parameter LLM used in a 5-shot generalisation setting yields a performance improvement of over 100% compared to an equivalent vanilla baseline on a suite of 10, resulting in an even larger 280B parameter baseline on the same suite of tasks

    • The SI framework provides a causal natural-language-based reasoning trace that has important implications for the safety and trustworthiness of the system.

  14. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models, Denny Zhou,Nathanael Schärli,Le Hou,Jason Wei,Nathan Scales,Xuezhi Wang,Dale Schuurmans,Claire Cui,Olivier Bousquet,Quoc Le,Ed Chi, 21-05-2022

    Categories

    Artificial Intelligence, Computation and Language

    Abstract

    Chain-of-thought prompting has demonstrated remarkable performance on various natural language reasoning tasks. However, it tends to perform poorly on tasks which requires solving problems harder than the exemplars shown in the prompts. To overcome this challenge of easy-to-hard generalization, we propose a novel prompting strategy, least-to-most prompting. The key idea in this strategy is to break down a complex problem into a series of simpler subproblems and then solve them in sequence. Solving each subproblem is facilitated by the answers to previously solved subproblems. Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts. A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting, it can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting. This is particularly noteworthy because neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples. We have included prompts for all the tasks in the Appendix.

    Bullet Points

    • The proposed prompting strategy is least-to-most prompting, which breaks down a complex problem into simpler subproblems and then solves them in sequence

    • It is capable of generalizing to more difficult problems than those seen in the prompts

    • The GPT-3 code-davinci-002 model can solve the compositional generalization benchmark SCAN in any split (including length split) with an accuracy of at least 99% using just 14 exemplars, compared to only 16% accuracy with chain-of-thought prompting

    • Neural-symbolic models in the literature that specialize in solving SCAN are trained on the entire training set containing over 15,000 examples.

  15. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, Tri Dao,Daniel Y. Fu,Stefano Ermon,Atri Rudra,Christopher Ré, 27-05-2022

    Categories

    Machine Learning

    Abstract

    Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

    Bullet Points

    • Transformers are slow and memory-hungry on long sequences due to the quadratic time and memory complexity of self-attention

    • Approximate attention methods often do not achieve wall-clock speedup

    • We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM

    • The algorithm requires fewer HBM accesses than standard attention, and it is optimal for a range of SRAM sizes

    • We analyze the IO complexity of FlashA Attention and extend it to block-sparse attention, yielding an approximate attention algorithm faster than any existing approximate attention method

    • Transformers achieve better-than-chance performance on the Path-X challenge and Path-256.

  16. Making Large Language Models Better Reasoners with Step-Aware Verifier, Yifei Li,Zeqi Lin,Shizhuo Zhang,Qiang Fu,Bei Chen,Jian-Guang Lou,Weizhu Chen, 06-06-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Few-shot learning is a challenging task that requires language models to generalize from limited examples. Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate. In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models. DIVERSE has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain. We evaluate DIVERSE on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).

    Bullet Points

    • DIVERSE is a novel approach that enhances the reasoning capabilities of language models by generating diverse prompts, using a verifier to filter out incorrect answers based on a weighted voting scheme, and verifying each reasoning step individually instead of the whole chain

    • It achieves new state-of-the-art results on six out of eight reasoning benchmarks, including GSM8K.

  17. From Human Days to Machine Seconds: Automatically Answering and Generating Machine Learning Final Exams, Iddo Drori,Sarah J. Zhang,Reece Shuttleworth,Sarah Zhang,Keith Tyser,Zad Chin,Pedro Lantigua,Saisamrit Surbehera,Gregory Hunter,Derek Austin,Leonard Tang,Yann Hicke,Sage Simhon,Sathwik Karnik,Darnell Granberry,Madeleine Udell, 11-06-2022

    Categories

    Machine Learning

    Abstract

    A final exam in machine learning at a top institution such as MIT, Harvard, or Cornell typically takes faculty days to write, and students hours to solve. We demonstrate that large language models pass machine learning finals at a human level, on finals available online after the models were trained, and automatically generate new human-quality final exam questions in seconds. Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses. In this work, we develop and compare methods that solve final exams, which differ from problem sets in several ways: the questions are longer, have multiple parts, are more complicated, and span a broader set of topics. We curate a dataset and benchmark of questions from machine learning final exams available online and code for answering these questions and generating new questions. We show how to generate new questions from other questions and course notes. For reproducibility and future research on this final exam benchmark, we use automatic checkers for multiple-choice, numeric, and questions with expression answers. We perform ablation studies comparing zero-shot learning with few-shot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics and find that few-shot learning methods perform best. We highlight the transformative potential of language models to streamline the writing and solution of large-scale assessments, significantly reducing the workload from human days to mere machine seconds. Our results suggest that rather than banning large language models such as ChatGPT in class, instructors should teach students to harness them by asking students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

    Bullet Points

    • The study demonstrates that large language models pass machine learning finals at a human level and automatically generate new human-quality final exam questions in seconds

    • Previous work has developed program synthesis and few-shot learning methods to solve university-level problem set questions in mathematics and STEM courses

    • We curate a dataset and benchmark of questions and generate new questions from other questions and course notes

    • For reproducibility and future research, we use automatic checkers for multiple-choice, numeric, and question with expression answers

    • Ablation studies compare zero-shot training with few-shoot learning and chain-of-thought prompting using GPT-3, OPT, Codex, and ChatGPT across machine learning topics

    • Few-shot Learning performs best, and instructors should teach students meta-questions about correctness, completeness, and originality of the responses generated, encouraging critical thinking in academic studies.

  18. Emergent Abilities of Large Language Models, Jason Wei,Yi Tay,Rishi Bommasani,Colin Raffel,Barret Zoph,Sebastian Borgeaud,Dani Yogatama,Maarten Bosma,Denny Zhou,Donald Metzler,Ed H. Chi,Tatsunori Hashimoto,Oriol Vinyals,Percy Liang,Jeff Dean,William Fedus, 15-06-2022

    Categories

    Computation and Language

    Abstract

    Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.

    Bullet Points

    • Scaling up language models has been shown to improve performance and sample efficiency on downstream tasks

    • However, the paper discusses an unpredictable phenomenon called emergent abilities of large language models, which cannot be predicted by extrapolating the performance of smaller models

    • Additional scaling could further expand the range of capabilities of language models.

  19. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge, Linxi Fan,Guanzhi Wang,Yunfan Jiang,Ajay Mandlekar,Yuncong Yang,Haoyi Zhu,Andrew Tang,De-An Huang,Yuke Zhu,Anima Anandkumar, 17-06-2022

    Categories

    Machine Learning, Artificial Intelligence, Computation and Language, Computer Vision

    Abstract

    Bullet Points

  20. Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos, Bowen Baker,Ilge Akkaya,Peter Zhokhov,Joost Huizinga,Jie Tang,Adrien Ecoffet,Brandon Houghton,Raul Sampedro,Jeff Clune, 23-06-2022

    Categories

    Machine Learning, Artificial Intelligence

    Abstract

    Pretraining on noisy, internet-scale datasets has been heavily studied as a technique for training models with broad, general capabilities for text, images, and other modalities. However, for many sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way. We extend the internet-scale pretraining paradigm to sequential decision domains through semi-supervised imitation learning wherein agents learn to act by watching online unlabeled videos. Specifically, we show that with a small amount of labeled data we can train an inverse dynamics model accurate enough to label a huge unlabeled source of online data -- here, online videos of people playing Minecraft -- from which we can then train a general behavioral prior. Despite using the native human interface (mouse and keyboard at 20Hz), we show that this behavioral prior has nontrivial zero-shot capabilities and that it can be fine-tuned, with both imitation learning and reinforcement learning, to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning. For many tasks our models exhibit human-level performance, and we are the first to report computer agents that can craft diamond tools, which can take proficient humans upwards of 20 minutes (24,000 environment actions) of gameplay to accomplish.

    Bullet Points

    • Pretraining on noisy internet-scale datasets has been extensively studied for training models with broad capabilities for text, images, and other modalities

    • However, for sequential decision domains such as robotics, video games, and computer use, publicly available data does not contain the labels required to train behavioral priors in the same way

    • Semi-supervised imitation learning involves training agents to act by watching online unlabeled videos

    • This behavioral prior has nontrivial zero-shot capabilities and can be fine-tuned with both imitation learning and reinforcement learning to hard-exploration tasks that are impossible to learn from scratch via reinforcement learning

    • Our models exhibit human-level performance and are the first to report computer agents that can craft diamond tools that can take proficient humans up to 20 minutes of gameplay.

  21. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action, Dhruv Shah,Blazej Osinski,Brian Ichter,Sergey Levine, 10-07-2022

    Categories

    Robotics, Artificial Intelligence, Computation and Language, Machine Learning

    Abstract

  22. Inner Monologue: Embodied Reasoning through Planning with Language Models, Wenlong Huang,Fei Xia,Ted Xiao,Harris Chan,Jacky Liang,Pete Florence,Andy Zeng,Jonathan Tompson,Igor Mordatch,Yevgen Chebotar,Pierre Sermanet,Noah Brown,Tomas Jackson,Linda Luu,Sergey Levine,Karol Hausman,Brian Ichter, 12-07-2022

    Categories

    Robotics, Artificial Intelligence, Computation and Language, Computer Vision, Machine Learning

    Abstract

    Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent to understand many semantic aspects of the world: the repertoire of skills available, how these skills influence the world, and how changes to the world map back to the language. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them - answers that change over time in response to the agent's own choices. In this work, we investigate to what extent LLMs used in such embodied contexts can reason over sources of feedback provided through natural language, without any additional training. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios. We investigate a variety of sources of feedback, such as success detection, scene description, and human interaction. We find that closed-loop language feedback significantly improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in a kitchen environment in the real world.

    Bullet Points

    • LLMs can be applied to domains beyond natural language processing, such as planning and interaction for robots

    • They can reason over sources of feedback provided through natural language without any additional training

    • Closed-loop language feedback improves high-level instruction completion on three domains, including simulated and real table top rearrangement tasks and long-horizon mobile manipulation tasks in the real world.

  23. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage, Kurt Shuster,Jing Xu,Mojtaba Komeili,Da Ju,Eric Michael Smith,Stephen Roller,Megan Ung,Moya Chen,Kushal Arora,Joshua Lane,Morteza Behrooz,William Ngan,Spencer Poff,Naman Goyal,Arthur Szlam,Y-Lan Boureau,Melanie Kambadur,Jason Weston, 05-08-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    We present BlenderBot 3, a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory, and having been trained on a large number of user defined tasks. We release both the model weights and code, and have also deployed the model on a public web page to interact with organic users. This technical report describes how the model was built (architecture, model and training scheme), and details of its deployment, including safety mechanisms. Human evaluations show its superiority to existing open-domain dialogue agents, including its predecessors (Roller et al., 2021; Komeili et al., 2022). Finally, we detail our plan for continual learning using the data collected from deployment, which will also be publicly released. The goal of this research program is thus to enable the community to study ever-improving responsible agents that learn through interaction.

    Bullet Points

    • BlenderBot 3 is a 175B parameter dialogue model capable of open-domain conversation with access to the internet and a long-term memory

    • It has been trained on a large number of user defined tasks and has been released with both model weights and code

    • The model has been deployed on public web pages to interact with organic users

    • The technical report describes the model's architecture, model and training scheme, and its deployment, including safety mechanisms

    • Human evaluations show its superiority to existing Open-Domain Dialogue Agents, including its predecessors

    • The plan for continual learning using the data collected from deployment will also be publicly released

    • The goal is to enable the community to study ever-improving responsible agents that learn through interaction.

  24. Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango, Aman Madaan,Amir Yazdanbakhsh, 16-09-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    The past decade has witnessed dramatic gains in natural language processing and an unprecedented scaling of large language models. These developments have been accelerated by the advent of few-shot techniques such as chain of thought (CoT) prompting. Specifically, CoT pushes the performance of large language models in a few-shot setup by augmenting the prompts with intermediate steps. Despite impressive results across various tasks, the reasons behind their success have not been explored. This work uses counterfactual prompting to develop a deeper understanding of CoT-based few-shot prompting mechanisms in large language models. We first systematically identify and define the key components of a prompt: symbols, patterns, and text. Then, we devise and conduct an exhaustive set of experiments across four different tasks, by querying the model with counterfactual prompts where only one of these components is altered. Our experiments across three models (PaLM, GPT-3, and CODEX) reveal several surprising findings and brings into question the conventional wisdom around few-shot prompting. First, the presence of factual patterns in a prompt is practically immaterial to the success of CoT. Second, our results conclude that the primary role of intermediate steps may not be to facilitate learning how to solve a task. The intermediate steps are rather a beacon for the model to realize what symbols to replicate in the output to form a factual answer. Further, text imbues patterns with commonsense knowledge and meaning. Our empirical and qualitative analysis reveals that a symbiotic relationship between text and patterns explains the success of few-shot prompting: text helps extract commonsense from the question to help patterns, and patterns enforce task understanding and direct text generation.

    Bullet Points

    • Few-shot techniques such as chain of thought (CoT) prompting have accelerated natural language processing and increased the performance of large language models

    • However, the reasons behind their success have not been explored

    • This study uses counterfactual prompting to develop a deeper understanding of CoT-based few-shot prompting mechanisms in big language models by identifying and defining the key components of a prompt: symbols, patterns, and text

    • The experiments reveal surprising findings that the presence of factual patterns is practically immaterial to CoT's success

    • The primary role of intermediate steps may not be to facilitate learning how to solve a task, but text imbues patterns with commonsense knowledge and meaning

    • The symbiotic relationship between text and patterns explains the success of few-shoot prompting

    • Text helps extract common sense from the question, helps patterns enforce task understanding and direct text generation.

  25. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, Pan Lu,Swaroop Mishra,Tony Xia,Liang Qiu,Kai-Wei Chang,Song-Chun Zhu,Oyvind Tafjord,Peter Clark,Ashwin Kalyan, 20-09-2022

    Categories

    Computation and Language, Artificial Intelligence, Computer Vision, Machine Learning, Multimedia

    Abstract

    Bullet Points

  26. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models, Ishika Singh,Valts Blukis,Arsalan Mousavian,Ankit Goyal,Danfei Xu,Jonathan Tremblay,Dieter Fox,Jesse Thomason,Animesh Garg, 22-09-2022

    Categories

    Robotics, Artificial Intelligence, Computation and Language, Machine Learning

    Abstract

  27. Promptagator: Few-shot Dense Retrieval From 8 Examples, Zhuyun Dai,Vincent Y. Zhao,Ji Ma,Yi Luan,Jianmo Ni,Jing Lu,Anton Bakalov,Kelvin Guu,Keith B. Hall,Ming-Wei Chang, 23-09-2022

    Categories

    Computation and Language, Information Retrieval

    Abstract

    Much recent research on information retrieval has focused on how to transfer from one task (typically with abundant supervised data) to various other tasks where supervision is limited, with the implicit assumption that it is possible to generalize from one task to all the rest. However, this overlooks the fact that there are many diverse and unique retrieval tasks, each targeting different search intents, queries, and search domains. In this paper, we suggest to work on Few-shot Dense Retrieval, a setting where each task comes with a short description and a few examples. To amplify the power of a few examples, we propose Prompt-base Query Generation for Retriever (Promptagator), which leverages large language models (LLM) as a few-shot query generator, and creates task-specific retrievers based on the generated data. Powered by LLM's generalization ability, Promptagator makes it possible to create task-specific end-to-end retrievers solely based on a few examples {without} using Natural Questions or MS MARCO to train %question generators or dual encoders. Surprisingly, LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 by more than 1.2 nDCG on average on 11 retrieval sets. Further training standard-size re-rankers using the same generated data yields another 5.0 point nDCG improvement. Our studies determine that query generation can be far more effective than previously observed, especially when a small amount of task-specific knowledge is given.

    Bullet Points

    • The paper proposes Prompt-base Query Generation for Retriever (Promptagator), which uses large language models (LLM) as a few-shot query generator to create task-specific end-to-end retrievers based on the generated data

    • LLM prompting with no more than 8 examples allows dual encoders to outperform heavily engineered models trained on MS MARCO like ColBERT v2 by more than 1.2 nDCG on average on 11 retrieval sets, and further training standard-size re-rankers using the same generated data yields another 5.0 point NDCG improvement.

  28. Ask Me Anything: A simple strategy for prompting language models, Simran Arora,Avanika Narayan,Mayee F. Chen,Laurel Orr,Neel Guha,Kush Bhatia,Ines Chami,Frederic Sala,Christopher Ré, 05-10-2022

    Categories

    Computation and Language

    Abstract

  29. Language Models are Multilingual Chain-of-Thought Reasoners, Freda Shi,Mirac Suzgun,Markus Freitag,Xuezhi Wang,Suraj Srivats,Soroush Vosoughi,Hyung Won Chung,Yi Tay,Sebastian Ruder,Denny Zhou,Dipanjan Das,Jason Wei, 06-10-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Bullet Points

  30. ReAct: Synergizing Reasoning and Acting in Language Models, Shunyu Yao,Jeffrey Zhao,Dian Yu,Nan Du,Izhak Shafran,Karthik Narasimhan,Yuan Cao, 06-10-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

  31. Automatic Chain of Thought Prompting in Large Language Models, Zhuosheng Zhang,Aston Zhang,Mu Li,Alex Smola, 07-10-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

  32. Interactive Language: Talking to Robots in Real Time, Corey Lynch,Ayzaan Wahid,Jonathan Tompson,Tianli Ding,James Betker,Robert Baruch,Travis Armstrong,Pete Florence, 12-10-2022

    Categories

    Robotics, Artificial Intelligence, Machine Learning

    Abstract

    Bullet Points

  33. Language Models of Code are Few-Shot Commonsense Learners, Aman Madaan,Shuyan Zhou,Uri Alon,Yiming Yang,Graham Neubig, 13-10-2022

    Categories

    Computation and Language, Machine Learning

    Abstract

    We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.

    Bullet Points

    • The paper proposes a new approach to structured commonsense reasoning that uses pre-trained LMs of code to generate a graph using natural language input

    • The paper shows that this approach outperforms natural-LMs that are fine-tuned on the target task, such as T5 and GPT-3 in the few-shot setting, even when the downstream task does not involve source code at all.

  34. Crosslingual Generalization through Multitask Finetuning, Niklas Muennighoff,Thomas Wang,Lintang Sutawika,Adam Roberts,Stella Biderman,Teven Le Scao,M Saiful Bari,Sheng Shen,Zheng-Xin Yong,Hailey Schoelkopf,Xiangru Tang,Dragomir Radev,Alham Fikri Aji,Khalid Almubarak,Samuel Albanie,Zaid Alyafeai,Albert Webson,Edward Raff,Colin Raffel, 03-11-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Bullet Points

  35. Large Language Models Are Human-Level Prompt Engineers, Yongchao Zhou,Andrei Ioan Muresanu,Ziwen Han,Keiran Paster,Silviu Pitis,Harris Chan,Jimmy Ba, 03-11-2022

    Categories

    Machine Learning, Artificial Intelligence, Computation and Language

    Abstract

    Bullet Points

  36. Ignore Previous Prompt: Attack Techniques For Language Models, Fábio Perez,Ian Ribeiro, 17-11-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Bullet Points

  37. Holistic Evaluation of Language Models, Percy Liang,Rishi Bommasani,Tony Lee,Dimitris Tsipras,Dilara Soylu,Michihiro Yasunaga,Yian Zhang,Deepak Narayanan,Yuhuai Wu,Ananya Kumar,Benjamin Newman,Binhang Yuan,Bobby Yan,Ce Zhang,Christian Cosgrove,Christopher D. Manning,Christopher Ré,Diana Acosta-Navas,Drew A. Hudson,Eric Zelikman,Esin Durmus,Faisal Ladhak,Frieda Rong,Hongyu Ren,Huaxiu Yao,Jue Wang,Keshav Santhanam,Laurel Orr,Lucia Zheng,Mert Yuksekgonul,Mirac Suzgun,Nathan Kim,Neel Guha,Niladri Chatterji,Omar Khattab,Peter Henderson,Qian Huang,Ryan Chi,Sang Michael Xie,Shibani Santurkar,Surya Ganguli,Tatsunori Hashimoto,Thomas Icard,Tianyi Zhang,Vishrav Chaudhary,William Wang,Xuechen Li,Yifan Mai,Yuhui Zhang,Yuta Koreeda, 16-11-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.

    Bullet Points

    • Holistic Evaluation of Language Models (HELM) aims to improve the transparency of language models by taxonomizing potential scenarios and metrics, selecting a broad subset based on coverage and feasibility, measuring 7 metrics for each of 16 core scenarios when possible, performing 7 targeted evaluations, and conducting a large-scale evaluation of 30 prominent language models on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation

    • Our evaluation surfaces 25 top-level findings, and we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit.

  38. Galactica: A Large Language Model for Science, Ross Taylor,Marcin Kardas,Guillem Cucurull,Thomas Scialom,Anthony Hartshorn,Elvis Saravia,Andrew Poulton,Viktor Kerkez,Robert Stojnic, 16-11-2022

    Categories

    Computation and Language, Machine Learning

    Abstract

    Information overload is a major obstacle to scientific progress. The explosive growth in scientific literature and data has made it ever harder to discover useful insights in a large mass of information. Today scientific knowledge is accessed through search engines, but they are unable to organize scientific knowledge alone. In this paper we introduce Galactica: a large language model that can store, combine and reason about scientific knowledge. We train on a large scientific corpus of papers, reference material, knowledge bases and many other sources. We outperform existing models on a range of scientific tasks. On technical knowledge probes such as LaTeX equations, Galactica outperforms the latest GPT-3 by 68.2% versus 49.0%. Galactica also performs well on reasoning, outperforming Chinchilla on mathematical MMLU by 41.3% to 35.7%, and PaLM 540B on MATH with a score of 20.4% versus 8.8%. It also sets a new state-of-the-art on downstream tasks such as PubMedQA and MedMCQA dev of 77.6% and 52.9%. And despite not being trained on a general corpus, Galactica outperforms BLOOM and OPT-175B on BIG-bench. We believe these results demonstrate the potential for language models as a new interface for science. We open source the model for the benefit of the scientific community.

    Bullet Points

    • The paper introduces Galactica, a large language model that can store, combine, and reason about scientific knowledge

    • The model outperforms existing models on various scientific tasks and sets a new state-of-the-art on downstream tasks such as PubMedQA, MedMCQA dev, and BIG-bench, despite not being trained on a general corpus

    • The paper encourages the scientific community to open source the model for its benefit.

  39. PAL: Program-aided Language Models, Luyu Gao,Aman Madaan,Shuyan Zhou,Uri Alon,Pengfei Liu,Yiming Yang,Jamie Callan,Graham Neubig, 18-11-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Bullet Points

  40. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks, Wenhu Chen,Xueguang Ma,Xinyi Wang,William W. Cohen, 22-11-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

  41. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models, Chan Hee Song,Jiaman Wu,Clayton Washington,Brian M. Sadler,Wei-Lun Chao,Yu Su, 08-12-2022

    Categories

    Artificial Intelligence, Computation and Language, Computer Vision, Machine Learning, Robotics

    Abstract

  42. Constitutional AI: Harmlessness from AI Feedback, Yuntao Bai,Saurav Kadavath,Sandipan Kundu,Amanda Askell,Jackson Kernion,Andy Jones,Anna Chen,Anna Goldie,Azalia Mirhoseini,Cameron McKinnon,Carol Chen,Catherine Olsson,Christopher Olah,Danny Hernandez,Dawn Drain,Deep Ganguli,Dustin Li,Eli Tran-Johnson,Ethan Perez,Jamie Kerr,Jared Mueller,Jeffrey Ladish,Joshua Landau,Kamal Ndousse,Kamile Lukosuite,Liane Lovitt,Michael Sellitto,Nelson Elhage,Nicholas Schiefer,Noemi Mercado,Nova DasSarma,Robert Lasenby,Robin Larson,Sam Ringer,Scott Johnston,Shauna Kravec,Sheer El Showk,Stanislav Fort,Tamera Lanham,Timothy Telleen-Lawton,Tom Conerly,Tom Henighan,Tristan Hume,Samuel R. Bowman,Zac Hatfield-Dodds,Ben Mann,Dario Amodei,Nicholas Joseph,Sam McCandlish,Tom Brown,Jared Kaplan, 15-12-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

    Bullet Points

    • To train a harmless AI assistant through self-improvement without human oversight, we use 'Constitutional AI', which involves both supervised learning and reinforcement learning

    • In the supervised phase, we sample from an initial model, generate self-critiques and revisions, finetune the original model on revised responses, and train the preference model from a dataset of AI preferences

    • The RL phase uses chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making, allowing for more precise control of AI behavior.

  43. Reasoning with Language Model Prompting: A Survey, Shuofei Qiao,Yixin Ou,Ningyu Zhang,Xiang Chen,Yunzhi Yao,Shumin Deng,Chuanqi Tan,Fei Huang,Huajun Chen, 19-12-2022

    Categories

    Computation and Language, Artificial Intelligence, Computer Vision, Information Retrieval, Machine Learning

    Abstract

    Bullet Points

  44. KronA: Parameter Efficient Tuning with Kronecker Adapter, Ali Edalati,Marzieh Tahaei,Ivan Kobyzev,Vahid Partovi Nia,James J. Clark,Mehdi Rezagholizadeh, 20-12-2022

    Categories

    Computation and Language

    Abstract

    Fine-tuning a Pre-trained Language Model (PLM) on a specific downstream task has been a well-known paradigm in Natural Language Processing. However, with the ever-growing size of PLMs, training the entire model on several downstream tasks becomes very expensive and resource-hungry. Recently, different Parameter Efficient Tuning (PET) techniques are proposed to improve the efficiency of fine-tuning PLMs. One popular category of PET methods is the low-rank adaptation methods which insert learnable truncated SVD modules into the original model either sequentially or in parallel. However, low-rank decomposition suffers from limited representation power. In this work, we address this problem using the Kronecker product instead of the low-rank representation. We introduce KronA, a Kronecker product-based adapter module for efficient fine-tuning of Transformer-based PLMs. We apply the proposed methods for fine-tuning T5 on the GLUE benchmark to show that incorporating the Kronecker-based modules can outperform state-of-the-art PET methods.

    Bullet Points

    • To improve the efficiency of fine-tuning a Pre-trained Language Model (PLM) on a specific downstream task, different parameter Efficient Tuning (PET) techniques have been proposed

    • One popular category of PET methods is the low-rank adaptation methods, which insert learnable truncated SVD modules into the original model either sequentially or in parallel

    • However, low rank decomposition suffers from limited representation power

    • To address this problem, we introduce KronA, a Kronecker product-based adapter module for efficient fine-ting of Transformer-based PLMs

    • We apply the proposed methods for fine-Tuning T5 on the GLUE benchmark to show that incorporating the Kroneckers-based modules can outperform state-of-the-art PET methods.

  45. Large Language Models Are Reasoning Teachers, Namgyu Ho,Laura Schmid,Se-Young Yun, 20-12-2022

    Categories

    Computation and Language, Artificial Intelligence, Machine Learning

    Abstract

    Bullet Points

  46. Self-Instruct: Aligning Language Models with Self-Generated Instructions, Yizhong Wang,Yeganeh Kordi,Swaroop Mishra,Alisa Liu,Noah A. Smith,Daniel Khashabi,Hannaneh Hajishirzi, 20-12-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Bullet Points

  47. Towards Reasoning in Large Language Models: A Survey, Jie Huang,Kevin Chen-Chuan Chang, 20-12-2022

    Categories

    Computation and Language, Artificial Intelligence

    Abstract

    Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.

    Bullet Points

    • The paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions for future work.
  48. Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters, Boshi Wang,Sewon Min,Xiang Deng,Jiaming Shen,You Wu,Luke Zettlemoyer,Huan Sun, 20-12-2022

    Categories

    Computation and Language

    Abstract

    Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs). CoT explicitly encourages the LLM to generate intermediate rationales for solving a problem, by providing a series of reasoning steps in the demonstrations. Despite its success, there is still little understanding of what makes CoT prompting effective and which aspects of the demonstrated reasoning steps contribute to its performance. In this paper, we show that CoT reasoning is possible even with invalid demonstrations - prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference. Further experiments show that other aspects of the rationales, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning. Overall, these findings both deepen our understanding of CoT prompting, and open up new questions regarding LLMs' capability to learn to reason in context.

    Bullet Points

    • CoT prompting can improve LLMs' multi-step reasoning abilities by encouraging them to generate intermediate rationales for solving a problem by providing a series of reasoning steps in the demonstrations

    • However, there is still a lack of understanding of what makes CoT effective and which aspects of the demonstrated reasoning steps contribute to its performance

    • In this paper, we show that prompting with invalid reasoning steps can achieve over 80-90% of the performance obtained using CoT under various metrics, while still generating coherent lines of reasoning during inference

    • Further experiments show that other aspects, such as being relevant to the query and correctly ordering the reasoning steps, are much more important for effective CoT reasoning

    • These findings deepen our understanding of CoT probing and open up new questions about LLM's ability to reason in context.

  49. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization, Srinivasan Iyer,Xi Victoria Lin,Ramakanth Pasunuru,Todor Mihaylov,Daniel Simig,Ping Yu,Kurt Shuster,Tianlu Wang,Qing Liu,Punit Singh Koura,Xian Li,Brian O'Horo,Gabriel Pereyra,Jeff Wang,Christopher Dewan,Asli Celikyilmaz,Luke Zettlemoyer,Ves Stoyanov, 22-12-2022

    Categories

    Computation and Language

    Abstract

    Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.

    Bullet Points

    • The paper discusses how fine-tuning large pre-trained language models improves their zero and few-shot generalization to unseen tasks

    • However, there is limited understanding of the performance trade-offs of different decisions made during the instruction-tuneing process

    • To address this, we create OPT-IML Bench, a large benchmark for Instruction Meta-Learning of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and create an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held -out tasks from seen categories, and to holds-out instances from seen tasks

    • We demonstrate all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats, including PromptSource, FLAN, Super-NaturalInstructions and UnifiedSKG

    • We release OPT-30B and 175B

  50. Cramming: Training a Language Model on a Single GPU in One Day, Jonas Geiping,Tom Goldstein, 28-12-2022

    Categories

    Computation and Language, Machine Learning

    Abstract

    We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU. Aside from re-analyzing nearly all components of the pretraining pipeline for this scenario and providing a modified pipeline with performance close to BERT, we investigate why scaling down is hard, and which modifications actually improve performance in this scenario. We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings. Through the lens of scaling laws, we categorize a range of recent improvements to training and architecture and discuss their merit and practical applicability (or lack thereof) for the limited compute setting.

    Bullet Points

    • We investigate downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling on a single consumer GPU

    • We investigate why scaling down is hard and which modifications actually improve performance in this scenario

    • We provide evidence that even in a constrained setting, performance closely follows scaling laws observed in large-compute settings

    • We categorize recent improvements to training and architecture and discuss their merit and practical applicability for the limited compute setting.