diff --git a/_contents/S0-L06.md b/_contents/S0-L06.md index 38052ee7..d3347209 100755 --- a/_contents/S0-L06.md +++ b/_contents/S0-L06.md @@ -2,7 +2,7 @@ layout: post title: Open Source LLM - Mistral Data preparation lecture: S0-Intro -lectureVersion: next +lectureVersion: current extraContent: notes: team-4 video: team-6 @@ -20,19 +20,25 @@ In this session, our readings cover: ## More Readings: -### - Llama 2: Open Foundation and Fine-Tuned Chat Models -+ In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. -### The Pile: An 800GB Dataset of Diverse Text for Language Modeling - + https://arxiv.org/abs/2101.00027 - + Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. - - - +### OLMo: Accelerating the Science of Language Models ++ https://arxiv.org/abs/2402.00838 +Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, this technical report details the first release of OLMo, a state-of-the-art, truly Open Language Model and its framework to build and study the science of language modeling. Unlike most prior efforts that have only released model weights and inference code, we release OLMo and the whole framework, including training data and training and evaluation code. We hope this release will empower and strengthen the open research community and inspire a new wave of innovation. ### Mixtral of Experts + https://arxiv.org/abs/2401.04088 + We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B - Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B - chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license. + + +### - Llama 2: Open Foundation and Fine-Tuned Chat Models ++ In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs. + +### The Pile: An 800GB Dataset of Diverse Text for Language Modeling + + https://arxiv.org/abs/2101.00027 + + Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction. + + + \ No newline at end of file diff --git a/_contents/S0-L07.md b/_contents/S0-L07.md index c41bf48c..ac992047 100755 --- a/_contents/S0-L07.md +++ b/_contents/S0-L07.md @@ -23,17 +23,23 @@ In this session, our readings cover: ## More Readings: +### Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition ++ https://arxiv.org/abs/2311.16119 ++ Large Language Models (LLMs) are deployed in interactive contexts with direct user engagement, such as chatbots and writing assistants. These deployments are vulnerable to prompt injection and jailbreaking (collectively, prompt hacking), in which models are manipulated to ignore their original instructions and follow potentially malicious ones. Although widely acknowledged as a significant security threat, there is a dearth of large-scale resources and quantitative studies on prompt hacking. To address this lacuna, we launch a global prompt hacking competition, which allows for free-form human input attacks. We elicit 600K+ adversarial prompts against three state-of-the-art LLMs. We describe the dataset, which empirically verifies that current LLMs can indeed be manipulated via prompt hacking. We also present a comprehensive taxonomical ontology of the types of adversarial prompts. + + +### Even More: ### ACL 2024 Tutorial: Vulnerabilities of Large Language Models to Adversarial Attacks -### https://llm-vulnerability.github.io/ ++ https://llm-vulnerability.github.io/ ### Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration + https://www.tandfonline.com/doi/full/10.1080/15228053.2023.2233814 - + + -### https://huggingface.co/blog?tag=ethics ++ https://huggingface.co/blog?tag=ethics + https://huggingface.co/blog/ethics-diffusers + https://huggingface.co/blog/model-cards + https://huggingface.co/blog/us-national-ai-research-resource @@ -43,9 +49,7 @@ In this session, our readings cover: + https://www.nist.gov/itl/ai-risk-management-framework + https://airc.nist.gov/AI_RMF_Knowledge_Base/Playbook + https://airc.nist.gov/AI_RMF_Knowledge_Base/Roadmap - + - -### EU AI Act / GDPR + + EU AI Act / GDPR diff --git a/_contents/S0-L08.md b/_contents/S0-L08.md index 45ae96e8..b1e890b5 100755 --- a/_contents/S0-L08.md +++ b/_contents/S0-L08.md @@ -20,9 +20,19 @@ In this session, our readings cover: ## More Readings: +### Copyright Plug-in Market for The Text-to-Image Copyright Protection ++ https://openreview.net/forum?id=pSf8rrn49H + +### Audio Deepfake Detection: A Survey ++ https://arxiv.org/abs/2308.14970 ++ Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences across various types of deepfake audio, then outline and analyse competitions, datasets, features, classifications, and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are discussed. In addition, we perform a unified comparison of representative features and classifiers on ASVspoof 2021, ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively. The survey shows that future research should address the lack of large scale datasets in the wild, poor generalization of existing detection methods to unknown fake attacks, as well as interpretability of detection results. + ### Membership Inference Attacks against Language Models via Neighbourhood Comparison https://aclanthology.org/2023.findings-acl.719/ -### Copyright Plug-in Market for The Text-to-Image Copyright Protection - https://openreview.net/forum?id=pSf8rrn49H + +### Deepfake Taylor Swift event: ++ https://www.cbsnews.com/news/taylor-swift-artificial-intellignence-ai-4chan/ + + diff --git a/_contents/S0-L09.md b/_contents/S0-L09.md index c595fa84..66c59767 100755 --- a/_contents/S0-L09.md +++ b/_contents/S0-L09.md @@ -14,20 +14,26 @@ In this session, our readings cover: ## Required Readings: +### Are Large Pre-Trained Language Models Leaking Your Personal Information? ++ https://arxiv.org/abs/2205.12628 ++ Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang +Are Large Pre-Trained Language Models Leaking Your Personal Information? In this paper, we analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name. We find that PLMs do leak personal information due to memorization. However, since the models are weak at association, the risk of specific personal information being extracted by attackers is low. We hope this work could help the community to better understand the privacy risk of PLMs and bring new insights to make PLMs safe. + ### Privacy Risks of General-Purpose Language Models + https://ieeexplore.ieee.org/abstract/document/9152761 - ++ We find the text embeddings from general-purpose language models would capture much sensitive information from the plain text. Once being accessed by the adversary, the embeddings can be reverse-engineered to disclose sensitive information of the victims for further harassment. Although such a privacy risk can impose a real threat to the future leverage of these promising NLP tools, there are neither published attacks nor systematic evaluations by far for the mainstream industry-level language models. To bridge this gap, we present the first systematic study on the privacy risks of 8 state-of-the-art language models with 4 diverse case studies. By constructing 2 novel attack classes, our study demonstrates the aforementioned privacy risks do exist and can impose practical threats to the application of general-purpose language models on sensitive data covering identity, genome, healthcare and location. For example, we show the adversary with nearly no prior knowledge can achieve about 75% accuracy when inferring the precise disease site from Bert embeddings of patients’ medical descriptions. As possible countermeasures, we propose 4 different defenses (via rounding, different... ## More Readings: +### Privacy in Large Language Models: Attacks, Defenses and Future Directions ++ https://arxiv.org/abs/2310.10383 ++ The advancement of large language models (LLMs) has significantly enhanced the ability to effectively tackle various downstream NLP tasks and unify these tasks into generative pipelines. On the one hand, powerful language models, trained on massive textual data, have brought unparalleled accessibility and usability for both models and users. On the other hand, unrestricted access to these models can also introduce potential malicious and unintentional privacy risks. Despite ongoing efforts to address the safety and privacy concerns associated with LLMs, the problem remains unresolved. In this paper, we provide a comprehensive analysis of the current privacy attacks targeting LLMs and categorize them according to the adversary's assumed capabilities to shed light on the potential vulnerabilities present in LLMs. Then, we present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks. Beyond existing works, we identify upcoming privacy concerns as LLMs evolve. Lastly, we point out several potential avenues for future exploration. + ### ProPILE: Probing Privacy Leakage in Large Language Models + https://arxiv.org/abs/2307.01881 + Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, Seong Joon Oh The rapid advancement and widespread use of large language models (LLMs) have raised significant concerns regarding the potential leakage of personally identifiable information (PII). These models are often trained on vast quantities of web-collected data, which may inadvertently include sensitive personal data. This paper presents ProPILE, a novel probing tool designed to empower data subjects, or the owners of the PII, with awareness of potential PII leakage in LLM-based services. ProPILE lets data subjects formulate prompts based on their own PII to evaluate the level of privacy intrusion in LLMs. We demonstrate its application on the OPT-1.3B model trained on the publicly available Pile dataset. We show how hypothetical data subjects may assess the likelihood of their PII being included in the Pile dataset being revealed. ProPILE can also be leveraged by LLM service providers to effectively evaluate their own levels of PII leakage with more powerful prompts specifically tuned for their in-house models. This tool represents a pioneering step towards empowering the data subjects for their awareness and control over their own data on the web. -### Are Large Pre-Trained Language Models Leaking Your Personal Information? -+ https://arxiv.org/abs/2205.12628 -+ Jie Huang, Hanyin Shao, Kevin Chen-Chuan Chang -Are Large Pre-Trained Language Models Leaking Your Personal Information? In this paper, we analyze whether Pre-Trained Language Models (PLMs) are prone to leaking personal information. Specifically, we query PLMs for email addresses with contexts of the email address or prompts containing the owner's name. We find that PLMs do leak personal information due to memorization. However, since the models are weak at association, the risk of specific personal information being extracted by attackers is low. We hope this work could help the community to better understand the privacy risk of PLMs and bring new insights to make PLMs safe. + diff --git a/_contents/S0-L11.md b/_contents/S0-L11.md index c57bf1f3..4b6a45aa 100755 --- a/_contents/S0-L11.md +++ b/_contents/S0-L11.md @@ -15,11 +15,14 @@ In this session, our readings cover: ## Required Readings: -### A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation - + https://arxiv.org/abs/2305.11391 - + https://huggingface.co/blog/red-teaming +### Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! + + https://arxiv.org/abs/2310.03693 +### Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training ++ https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training ++ Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. + ## More Readings: ### SafeText: A Benchmark for Exploring Physical Safety in Language Models @@ -29,6 +32,12 @@ In this session, our readings cover: ### ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation / EMNLP2023 + +### A Survey of Safety and Trustworthiness of Large Language Models through the Lens of Verification and Validation + + https://arxiv.org/abs/2305.11391 + + https://huggingface.co/blog/red-teaming + + ### Lessons learned on language model safety and misuse + https://openai.com/research/language-model-safety-and-misuse @@ -37,6 +46,7 @@ In this session, our readings cover: + ### Tracing Model Outputs to the Training Data + https://www.anthropic.com/news/influence-functions diff --git a/_contents/S0-L12.md b/_contents/S0-L12.md index aff7602e..9811791b 100755 --- a/_contents/S0-L12.md +++ b/_contents/S0-L12.md @@ -15,18 +15,17 @@ In this session, our readings cover: ## Required Readings: -### Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! - + https://arxiv.org/abs/2310.03693 +### Cheating Suffix: Targeted Attack to Text-To-Image Diffusion Models with Multi-Modal Priors ++ Dingcheng Yang, Yang Bai, Xiaojun Jia, Yang Liu, Xiaochun Cao, Wenjian Yu ++ Diffusion models have been widely deployed in various image generation tasks, demonstrating an extraordinary connection between image and text modalities. However, they face challenges of being maliciously exploited to generate harmful or sensitive images by appending a specific suffix to the original prompt. Existing works mainly focus on using single-modal information to conduct attacks, which fails to utilize multi-modal features and results in less than satisfactory performance. Integrating multi-modal priors (MMP), i.e. both text and image features, we propose a targeted attack method named MMP-Attack in this work. Specifically, the goal of MMP-Attack is to add a target object into the image content while simultaneously removing the original object. The MMP-Attack shows a notable advantage over existing works with superior universality and transferability, which can effectively attack commercial text-to-image (T2I) models such as DALL-E 3. To the best of our knowledge, this marks the first successful attempt of transfer-based attack to commercial T2I models. Our code is publicly available at .... -### Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training -+ https://www.anthropic.com/news/sleeper-agents-training-deceptive-llms-that-persist-through-safety-training -+ Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoor behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety. +### A Pilot Study of Query-Free Adversarial Attack against Stable Diffusion ++ https://ieeexplore.ieee.org/document/10208563 ++ Despite the record-breaking performance in Text-to-Image (T2I) generation by Stable Diffusion, less research attention is paid to its adversarial robustness. In this work, we study the problem of adversarial attack generation for Stable Diffusion and ask if an adversarial text prompt can be obtained even in the absence of end-to-end model queries. We call the resulting problem ‘query-free attack generation’. To resolve this problem, we show that the vulnerability of T2I models is rooted in the lack of robustness of text encoders, e.g., the CLIP text encoder used for attacking Stable Diffusion. Based on such insight, we propose both untargeted and targeted query-free attacks, where the former is built on the most influential dimensions in the text embedding space, which we call steerable key dimensions. By leveraging the proposed attacks, we empirically show that only a five-character perturbation to the text prompt is able to cause the significant content shift of synthesized images using Stable Diffusion. Moreover, we show that the proposed target attack can precisely steer the diffusion model to scrub the targeted image content without causing much change in untargeted image content. -## More Readings: -### Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned - - https://arxiv.org/abs/2209.07858 +## More Readings: ### GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse @@ -37,5 +36,8 @@ In this session, our readings cover: +### Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned + - https://arxiv.org/abs/2209.07858 + diff --git a/_contents/S0-L13.md b/_contents/S0-L13.md index 3d731396..5a75ba5b 100755 --- a/_contents/S0-L13.md +++ b/_contents/S0-L13.md @@ -17,9 +17,17 @@ In this session, our readings cover: ### Managing Existential Risk from AI without Undercutting Innovation + https://www.csis.org/analysis/managing-existential-risk-ai-without-undercutting-innovation +### OpenAI on LLM generated bio-x-risk ++ Building an early warning system for LLM-aided biological threat creation ++ https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation ## More Readings: +### A misleading open letter about sci-fi AI dangers ignores the real risks + https://www.aisnakeoil.com/p/a-misleading-open-letter-about-sci + +### Evaluating social and ethical risks from generative AI + + https://deepmind.google/discover/blog/evaluating-social-and-ethical-risks-from-generative-ai/ ### Emergent autonomous scientific research capabilities of large language models + https://arxiv.org/abs/2304.05332 @@ -29,8 +37,3 @@ In this session, our readings cover: ### On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? + https://dl.acm.org/doi/10.1145/3442188.3445922 -### A misleading open letter about sci-fi AI dangers ignores the real risks - https://www.aisnakeoil.com/p/a-misleading-open-letter-about-sci - -### Evaluating social and ethical risks from generative AI - + https://deepmind.google/discover/blog/evaluating-social-and-ethical-risks-from-generative-ai/ \ No newline at end of file diff --git a/_contents/S0-L14.md b/_contents/S0-L14.md index 1090c632..2c5091af 100755 --- a/_contents/S0-L14.md +++ b/_contents/S0-L14.md @@ -21,10 +21,11 @@ In this session, our readings cover: ## More Readings: -### Making Retrieval Augmented Generation Fast - + https://www.pinecone.io/learn/fast-retrieval-augmented-generation/ - ### LlamaIndex +### LlamaIndex + https://docs.llamaindex.ai/en/stable/ LlamaIndex supports Retrieval-Augmented Generation (RAG). Instead of asking LLM to generate an answer immediately, LlamaIndex: retrieves information from your data sources first, / adds it to your question as context, and / asks the LLM to answer based on the enriched prompt. + +### Making Retrieval Augmented Generation Fast + + https://www.pinecone.io/learn/fast-retrieval-augmented-generation/