Advanced NLP (SCIA / ANLP1 & ANLP2)

Sessions

Recap on Deep Learning & basic NLP (slides / lab session)
Tokenization (slides / lab session)
Language Modeling (slides / lab session)
NLP without 2048 GPUs (slides / lab session)
Language Models at Inference Time (slides / lab session)
Handling the Risks of Language Models (slides / lab session)
Advanced NLP tasks (slides / lab session)
Domain-specific NLP (slides / lab session)
Multilingual NLP (slides / lab session)
Multimodal NLP (slides / lab session

Evaluation

The evaluation consists in a team project (3-5 people). The choice of the subject is free but needs to follow some basic rules:

Obviously, the project must be highly related with NLP and especially with the notions we will cover in the course
You can only use open-source LLM that you serve yourself. In other words, no API / ChatGPT-like must be used, except for final comparison with your model.
You must identify and address a challenging problem (e.g. not only can a LLM do X?, but can a LLM that runs on a CPU do X?, or can I make a LLM better at X?)
It must be reasonably doable: you will not be able to fine-tune (even to use) a 405B parameters model, or to train a model from scratch. That's fine, there are a lot of smaller models that should be good enough, like the Pythia models, TinyLLama, the 1B parameter OLMo, or the small models from the Llama3.2 suite.

⏰ The project follows 3 deadlines:

Project announcement (before 25/10/24): send an email to [email protected] with cc's [email protected] and [email protected] explaining
- The team members (also cc'ed)
- A vague description of the project (it can change later on)
Project proposal (25% of final grade, before 15/11/24): following this template, produce a project proposal explaining first attempts (e.g. version alpha), how they failed/succeeded and what you want to do before the delivery.
Project delivery (75% of final grade, 13/12/24): delivery of a GitHub repo with an explanatory README + oral presentation on December 13th

Inspiring articles

Tokenization

A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning (https://arxiv.org/abs/2204.10815)
BPE-Dropout: Simple and Effective Subword Regularization (https://aclanthology.org/2020.acl-main.170/)
FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models (https://aclanthology.org/2023.emnlp-main.829/)

Fast inference

Efficient Streaming Language Models with Attention Sinks (https://arxiv.org/abs/2309.17453)
Lookahead decoding (https://lmsys.org/blog/2023-11-21-lookahead-decoding/)
Efficient Memory Management for Large Language Model Serving with PagedAttention (https://arxiv.org/pdf/2309.06180.pdf)

Inference-time scaling (OpenAI's o1 model)

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (https://arxiv.org/abs/2201.11903)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (https://arxiv.org/abs/2408.03314v1)

LLM detection

Detecting Pretraining Data from Large Language Models (https://arxiv.org/abs/2310.16789)
Proving Test Set Contamination in Black Box Language Models (https://arxiv.org/abs/2310.17623)

SSMs (off-program)

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (https://arxiv.org/abs/2312.00752)

Alignment & Safety

Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection (https://aclanthology.org/2020.acl-main.647/)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model (https://arxiv.org/abs/2305.18290)
Text Embeddings Reveal (Almost) As Much As Text (https://arxiv.org/abs/2310.06816)

Name		Name	Last commit message	Last commit date
Latest commit History 183 Commits
.github/workflows		.github/workflows
imgs		imgs
lab_session_data		lab_session_data
markdown		markdown
slides		slides
static		static
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced NLP (SCIA / ANLP1 & ANLP2)

Sessions

Evaluation

Inspiring articles

Tokenization

Fast inference

Inference-time scaling (OpenAI's o1 model)

LLM detection

SSMs (off-program)

Alignment & Safety

About

Releases

Packages

Contributors 4

Languages

NathanGodey/AdvancedNLP

Folders and files

Latest commit

History

Repository files navigation

Advanced NLP (SCIA / ANLP1 & ANLP2)

Sessions

Evaluation

Inspiring articles

Tokenization

Fast inference

Inference-time scaling (OpenAI's o1 model)

LLM detection

SSMs (off-program)

Alignment & Safety

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages