This project documents my journey into the depths of Large Language Models (LLMs). I felt there was no better way to gain a real understanding of them than through a hands-on experience. I bought the book Build a Large Language Model (From Scratch) by Sebastian Raschka, and embarked on a quest to build a GPT-2 medium model from scratch making sure to understand every steps along the way.
Disclaimer: It is important to acknowledge that certain aspects of this project are significantly based on the knowledge and code provided in the aforementioned book (github repository)
Project Goal: My objective is to construct a GPT-2 medium model, initialize it with pre-trained weights from OpenAI, and then fine-tune it for a specific purpose: answering questions about the books I have read.
It seems like with the rise of LLMs, everyone is eager to see them as the solution to every problem. It reminds me of the saying, "to a man with a hammer, everything looks like a nail." While LLMs are undeniably powerful, it's important to remember that other approaches, like XGBoost for classification tasks, often remain superior. Still, my curiosity (and perhaps a bit of FOMO) led me to explore the depths of LLMs.
In my view, machine learning is an invention, while LLMs are more akin to discoveries. We're still unraveling their full potential, much like the early days of understanding electricity after it was harnessed. Think of it this way: engineering gave us the invention of electrical systems, and that invention led directly to the discovery of electromagnetic waves, opening up a whole new world of possibilities, like radio and wireless communication. Similarly, LLMs continue to surprise us with emergent behaviors. Early on, their unexpected ability to translate languages, despite being trained primarily for next-word prediction, hinted at their hidden potential. More recently, we've seen the emergence of capabilities like chain-of-thought reasoning, where models can break down complex problems and explain their steps. These unexpected capabilities highlight the fact that we're still just scratching the surface of what LLMs can do.
This project explores the capabilities of a GPT-2 medium model (355 million parameters), which has a 24-layer transformer architecture and uses 16 heads in each multi-head attention mechanism. Although larger models are available, the hardware constraints of my Macbook Air M1 make this a practical starting point for experimentation.
This is just the beginning of my exploration. My next steps involve fine-tuning the model on a question-answering dataset and further specializing it with the contents of books I've read. This will allow me to query the model about those books and delve deeper into its understanding of the text.
I highly recommend the book Build a Large Language Model (From Scratch) to anyone who wants to dive deep into the world of LLMs. It's well-balanced, comprising roughly 33% text, 33% figures, and 33% code, making it an accessible read. The author meticulously breaks down each step and ensures that all the necessary information is included for curious readers. For instance, there's even an appendix covering PyTorch for those who need it.