Accept this assignment: GitHub Classroom Link
Due: March 9, 2026 at 11:59 PM EST
Click the link above to create your private repository for this assignment. Complete your work in Google Colab, then push your notebook, writeup, and presentation to the repository before the deadline.
The final project is your opportunity to synthesize everything you've learned throughout this course into an ambitious, cutting-edge project that showcases your mastery of large language models and conversational AI. This is a capstone experience where you'll tackle a challenging problem that would be nearly impossible in a traditional course—but is achievable with the help of modern GenAI tools like Claude, ChatGPT, and GitHub Copilot.
Unlike the structured problem sets, this project is open-ended and research-oriented. You have the creative freedom to explore novel applications, replicate and extend published research, build innovative systems, or conduct rigorous evaluations of LLM capabilities. The goal is to produce work that is intellectually sophisticated, technically ambitious, and potentially publishable or deployable in the real world.
This is your chance to:
- Explore a topic you're genuinely passionate about
- Push the boundaries of what you thought possible
- Create something meaningful that extends beyond the classroom
- Build a portfolio piece that demonstrates your AI/ML capabilities
- Potentially contribute to ongoing research in the field
You will work in teams of 2-3 students over 4 weeks (Weeks 7-10), leveraging your combined skills and interests to tackle problems that would be challenging for any individual. The collaborative nature of this project mirrors real-world research and industry practices.
By completing this project, you will:
- Synthesize course concepts: Integrate knowledge from all assignments—from rule-based systems (ELIZA) to modern transformers (GPT), from embeddings to fine-tuning.
- Design and execute research: Formulate research questions, design methodologies, implement solutions, and critically evaluate results.
- Master LLM tools and frameworks: Work extensively with Hugging Face, OpenAI/Anthropic APIs, PyTorch, and other modern ML frameworks.
- Leverage GenAI for ambitious projects: Use AI coding assistants to implement complex systems that would traditionally take months.
- Communicate technical work: Present findings clearly through code, visualizations, presentations, and written reports.
- Think critically about AI: Evaluate limitations, biases, ethical implications, and societal impacts of your work.
- Collaborate effectively: Work in a team to divide tasks, integrate components, and produce cohesive results.
Your project should represent approximately 4 weeks of focused work for a team of 2-3 students (Weeks 7-10). This translates to roughly 15-20 hours per person per week, though the actual time investment may vary based on your background and the project's complexity.
With GenAI assistance, you can tackle projects that would have been impossible just a few years ago. However, be realistic about scope:
Appropriate scale:
- Fine-tuning a model on a custom dataset for a specific task
- Building a multi-component system (e.g., RAG with custom embeddings and retrieval)
- Replicating a recent paper and extending it with new experiments
- Creating a novel evaluation framework for LLM capabilities
- Developing an agent-based system with multiple tools and reasoning components
Too small:
- Simple prompt engineering exercises
- Basic API wrapper applications
- Straightforward classification tasks using existing models
- Tutorials or literature reviews without implementation
Too ambitious (probably):
- Training a foundation model from scratch
- Building a production-scale application with full frontend/backend
- Projects requiring weeks of data collection
- Multiple independent research questions that should be separate projects
A good rule of thumb: your project should involve at least 3-4 substantial technical components (e.g., custom data processing + model fine-tuning + evaluation framework + analysis), where each component requires thoughtful design and implementation.
Your project should fall into one or more of the following categories:
Design and implement a new application that leverages LLMs in creative ways. Focus on solving a real problem or creating something genuinely useful.
Examples:
- Domain-specific assistants (medical diagnosis helper, legal research tool, educational tutor)
- Creative tools (story co-writer, game master, music lyric generator)
- Accessibility applications (sign language translation, simplified text generation)
- Professional tools (code documentation generator, literature review assistant)
Reproduce results from a recent research paper, then extend the work with new experiments, datasets, or analyses.
Examples:
- Replicate a fine-tuning approach from a paper, then test on new domains
- Reproduce an evaluation study, then extend with additional models or metrics
- Implement a published architecture, then ablate components to understand contributions
Explore modifications to existing architectures or novel training procedures. This is advanced but achievable with GenAI help.
Examples:
- Modified attention mechanisms for specific tasks
- Novel fine-tuning approaches (e.g., parameter-efficient methods)
- Hybrid architectures combining different model types
- Custom loss functions or training objectives
Conduct rigorous empirical studies of LLM capabilities, limitations, or behaviors.
Examples:
- Systematic evaluation of reasoning abilities across multiple models
- Bias analysis in specific domains or demographic groups
- Consistency and reliability testing under various conditions
- Comparative studies of different prompting strategies or model families
Build systems that integrate multiple modalities (text, image, audio) or implement autonomous agents with reasoning and tool use.
Examples:
- Vision-language systems for image description or visual question answering
- Agents that can plan, use tools, and complete complex tasks
- Multi-agent systems with specialized roles and communication
- Cross-modal retrieval or generation systems
Investigate critical issues around AI safety, bias, fairness, or societal impact.
Examples:
- Jailbreaking detection and mitigation strategies
- Bias measurement and debiasing techniques
- Factuality and hallucination analysis
- Privacy and security considerations in LLM applications
Here are 20 diverse project ideas to inspire you. These span different difficulty levels and interest areas. Feel free to use these as starting points, combine elements, or create something entirely new.
-
Domain-Specific RAG System: Build a retrieval-augmented generation system for a specialized domain (e.g., medical literature, legal documents, historical archives). Implement custom embeddings, retrieval strategies, and evaluate against baselines.
-
Conversational Search Engine: Create a multi-turn conversational search system that maintains context, asks clarifying questions, and provides source citations. Compare different retrieval and reranking strategies.
-
Hybrid Retrieval Architecture: Combine dense embeddings, sparse retrieval (BM25), and knowledge graphs for enhanced retrieval. Systematically evaluate each component's contribution.
-
Low-Resource Language Adaptation: Fine-tune an LLM for a low-resource language or dialect using limited data. Explore techniques like cross-lingual transfer, data augmentation, and parameter-efficient fine-tuning.
-
Scientific Writing Assistant: Fine-tune a model to help researchers write better papers by learning from high-quality publications in a specific field. Implement style transfer and citation generation.
-
Personalized Conversation Models: Develop methods to adapt dialogue models to individual user preferences and communication styles while preserving helpfulness and safety.
-
Multi-Turn Reasoning Dialogues: Build a conversational system that can engage in extended reasoning tasks (math problems, logic puzzles, strategic planning) while explaining its thought process.
-
Empathetic Support Chatbot: Create a psychologically-informed chatbot for emotional support or mental health check-ins. Evaluate empathy, safety, and effectiveness using human evaluation and automated metrics.
-
Socratic Tutoring System: Implement an educational assistant that uses Socratic questioning to guide students through problem-solving rather than providing direct answers.
-
Code Documentation Generator: Build a system that generates comprehensive documentation for code repositories, including docstrings, README files, and tutorials. Test on diverse codebases and programming languages.
-
Bug Detection and Repair: Develop an LLM-based system for finding and fixing bugs in code. Create a benchmark and compare against existing tools.
-
Technical Interview Practice System: Create an interactive system that conducts realistic technical interviews, provides feedback, and adapts difficulty based on performance.
-
Interactive Fiction Engine: Build a system for collaborative storytelling where the model maintains narrative coherence, character consistency, and player agency across long interactions.
-
Cross-Cultural Communication Assistant: Develop a tool that helps bridge cultural communication gaps by explaining context, suggesting phrasing, and highlighting potential misunderstandings.
-
Historical Dialogue Simulation: Create historically-accurate conversational agents representing figures from different time periods, trained on period-appropriate texts and evaluated for authenticity.
-
LLMs as Cognitive Models: Systematically evaluate whether LLMs exhibit human-like cognitive biases, heuristics, and reasoning patterns. Compare model behavior to psychological literature.
-
Language Acquisition Simulation: Investigate how LLMs "learn" linguistic structures by analyzing learning trajectories, comparing to child language acquisition data.
-
Theory of Mind in LLMs: Design experiments to test whether models exhibit theory of mind capabilities—understanding beliefs, intentions, and mental states of others.
-
Mechanistic Interpretability Study: Use techniques like attention visualization, activation probing, and causal interventions to understand how models perform specific tasks (e.g., factual recall, arithmetic, syntax).
-
Failure Mode Taxonomy: Systematically categorize and analyze LLM failures across multiple models and tasks. Develop predictors for when failures occur and potential mitigations.
-
Multi-Step Planning Agent: Build an agent that can decompose complex goals, create plans, execute actions with tools, and adapt based on feedback. Test on challenging benchmarks.
-
Scientific Hypothesis Generator: Create a system that reads research papers, identifies gaps, and proposes testable hypotheses. Evaluate novelty and plausibility with domain experts.
Your final project must include four key deliverables:
Submit a 1-2 page proposal that includes:
- Title and team members: Who is working on this?
- Motivation: What problem are you solving? Why does it matter?
- Research questions: What specific questions will you investigate?
- Approach: What methods, models, and datasets will you use?
- Expected outcomes: What do you hope to achieve or demonstrate?
- Timeline: How will you allocate work over the project period?
- Division of labor: How will team members split responsibilities?
The proposal helps you clarify your ideas and allows us to provide early feedback to ensure your project is appropriately scoped.
Submit a well-documented Jupyter notebook that:
- Runs completely in Google Colaboratory without errors
- Includes clear markdown explanations of your approach, decisions, and findings
- Contains all code necessary to reproduce your results
- Automatically downloads/loads any required datasets, models, or resources
- Follows best practices: modular code, informative comments, reproducible results
- Includes visualizations, tables, and figures to illustrate key findings
- Provides clear instructions for running experiments or using your system
Code quality matters: Your notebook should be something you'd be proud to share publicly or include in a portfolio.
Create a 10-12 minute video presentation and present it in class:
- Video submission: Upload to YouTube (unlisted is fine) and include the link in your writeup
- In-class component: Be prepared to present your video and lead a 5-10 minute discussion/Q&A
- Content: Cover motivation, approach, key results, insights, and limitations
- Visuals: Use slides, demos, visualizations—make it engaging and clear
- Team participation: All team members should contribute to the presentation
The presentation should be accessible to your classmates—assume they're smart but may not know the specific technical details of your domain.
Submit a 2-5 page writeup (not including references or appendices) structured as:
Introduction (0.5-1 page)
- Motivation and background
- Research questions or goals
- Why this problem is interesting/important
Approach (1-2 pages)
- Methods, models, datasets used
- Technical details of your implementation
- Design decisions and justifications
Results (1-1.5 pages)
- Key findings, with figures and tables
- Quantitative metrics and qualitative observations
- Comparison to baselines or related work
Discussion (0.5-1 page)
- Interpretation of results
- Limitations and potential improvements
- Broader implications
- Future directions
References
- Cite relevant papers, tools, datasets
- Follow a consistent citation format (APA, IEEE, etc.)
Write clearly and concisely. Think of this as a mini research paper suitable for a workshop or course anthology.
- Form teams of 2-3 students
- Brainstorm project ideas
- Review recent papers and existing systems for inspiration
- Discuss with instructor or peers to refine ideas
- Deliverable: Team formed, preliminary idea identified
- Develop detailed project proposal (1-2 pages)
- Identify datasets, models, and tools needed
- Create timeline and divide initial responsibilities
- Submit proposal and receive feedback
- Begin implementation: setup environment, load data, test baseline approaches
- Deliverable: Proposal submitted, initial codebase, preliminary experiments
- Implement main technical components
- Run experiments and collect results
- Iterate based on findings—adjust approaches as needed
- Begin drafting writeup and presentation materials
- Deliverable: Major progress on implementation, preliminary results
- Finalize code and ensure reproducibility
- Complete all experiments and analysis
- Finish writeup and presentation video
- Present to class and engage in discussions
- Deliverable: All final materials submitted, presentation delivered
Your project will be evaluated holistically, with the final grade broken down as:
- Clarity: Is the project clearly defined?
- Feasibility: Is the scope appropriate for the timeline?
- Significance: Does it address an interesting problem?
- Technical quality: Is the code well-written, documented, and reproducible?
- Sophistication: Does it demonstrate mastery of course concepts?
- Functionality: Does it work as intended?
- Reproducibility: Can results be replicated in Colab?
- Rigor: Are experiments well-designed and thorough?
- Insights: Do results provide meaningful insights?
- Evaluation: Are appropriate metrics and baselines used?
- Critical thinking: Are limitations and interpretations discussed thoughtfully?
- Clarity: Is the presentation clear and well-organized?
- Engagement: Is it interesting and accessible?
- Completeness: Does it cover all key aspects?
- Participation: Did all team members contribute?
- Writing quality: Is it clear, concise, and well-structured?
- Completeness: Does it cover all required sections?
- Depth: Does it provide sufficient detail and analysis?
- References: Are sources properly cited?
- Distribution of work: Did all team members contribute meaningfully?
- Integration: Are components well-integrated into a cohesive whole?
- Process: Did the team work effectively together?
Note: Exceptional projects that go above and beyond expectations may receive bonus points. Projects that demonstrate novel insights, publishable quality, or significant real-world applicability will be highlighted.
Review all assignments and lectures—they contain techniques, tools, and insights directly applicable to your project:
- Assignment 1 (ELIZA): Rule-based systems, pattern matching, conversation management
- Assignment 2 (SPAM): Classification, evaluation metrics, text preprocessing
- Assignment 3 (Wikipedia): Embeddings (LDA, BERT, GPT-2, Llama), clustering, dimensionality reduction (PCA, UMAP)
- Assignment 4 (Customer Service): Dialogue systems, context management, response generation
- Assignment 5 (GPT): Model architectures, training from scratch, transformer mechanics
- Hugging Face Transformers: Pre-trained models, tokenizers, fine-tuning pipelines
- Hugging Face Datasets: Access to thousands of datasets
- OpenAI API / Anthropic API: Access to GPT-4, Claude, and other commercial models
- LangChain: Building LLM applications, chains, and agents
- FAISS / Pinecone: Vector databases for retrieval systems
- PyTorch / TensorFlow: Deep learning frameworks
- Weights & Biases / TensorBoard: Experiment tracking and visualization
- Ollama: Running open-source models locally
- Hugging Face Datasets
- Papers with Code Datasets
- Google Dataset Search
- Kaggle Datasets
- Common Crawl
- The Pile
Stay current with cutting-edge work:
- arXiv cs.CL (Computation and Language)
- arXiv cs.AI (Artificial Intelligence)
- Papers with Code (Latest research with code)
- ACL Anthology (NLP conference papers)
- NeurIPS, ICML, ICLR (ML conferences)
- Discord: Use our course Discord for questions, discussions, and collaboration
- Office Hours: Schedule time to discuss project ideas, troubleshoot issues, or get feedback
- Peers: Discuss ideas with other teams—collaboration across teams is encouraged
- GenAI Tools: Use ChatGPT, Claude, GitHub Copilot extensively—document what you use
Don't underestimate the time required. Even with GenAI assistance, projects take longer than expected. Starting early gives you time to iterate, handle unexpected challenges, and produce polished results.
Not everything will work on the first try. That's normal in research. Build in time to try multiple approaches, debug issues, and refine your methods based on what you learn.
This is where you can be truly ambitious. Use tools like Claude, ChatGPT, and GitHub Copilot to:
- Implement complex architectures you haven't built before
- Write data processing pipelines quickly
- Debug tricky errors
- Generate boilerplate code
- Understand unfamiliar libraries or papers
- Brainstorm ideas and approaches
Remember: You're responsible for understanding and validating what AI generates. Always test, verify, and critically evaluate AI-generated code.
It's better to do one thing really well than three things poorly. If you find your project is too ambitious, narrow the focus rather than delivering incomplete work.
Technical sophistication is important, but so is clear communication. Make sure your code, writeup, and presentation are understandable to intelligent non-experts.
Keep track of experiments, decisions, and results as you go. This makes writing the final report much easier and helps you understand what worked and why.
Different team members may excel at different aspects (coding, experimentation, writing, visualization). Divide labor strategically but ensure everyone understands the full project.
Choose a project you're genuinely excited about. Passion and curiosity will sustain you through challenges and lead to better results.
Note: As this is a new course, we don't yet have prior student projects to showcase. However, here are examples of the caliber of work we hope to see:
- Cognitive bias evaluation framework: Systematic testing of whether LLMs exhibit human-like cognitive biases, with comparisons across model families
- Domain-adapted medical QA system: Fine-tuned model for answering medical questions with retrieval augmentation, evaluated against human experts
- Interactive mathematical reasoning tutor: Socratic dialogue system that guides students through proofs and problem-solving
- Multimodal scientific figure understanding: System that can interpret graphs, diagrams, and figures from research papers and answer questions about them
- Debate agent with evidence retrieval: Autonomous system that can argue positions on controversial topics by retrieving and citing evidence
Your project can match or exceed these examples. With modern tools and GenAI assistance, sophisticated projects are within reach.
This assignment is submitted via GitHub Classroom. Follow these steps:
-
Accept the assignment: Click the assignment link provided in Canvas or by your instructor
- Repository: github.com/ContextLab/final-project-llm-course
- This creates your own private repository for the assignment
-
Clone your repository:
git clone https://github.com/ContextLab/final-project-llm-course-YOUR_USERNAME.git
-
Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook, writeup, and presentation materials to the repository
-
Commit and push your changes:
git add . git commit -m "Complete final project" git push
-
Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
- Project Proposal: Week 8 (submit via GitHub Classroom)
- Final Submission (Code, Writeup, Presentation): March 9, 2026 at 11:59 PM EST
Your project must run in Google Colab. This means:
- Use publicly accessible datasets (or include download code)
- Include all installation commands in the notebook (
!pip install ...) - Don't rely on local files or proprietary data
- Test in a fresh Colab session before submission
- Consider compute constraints (RAM, GPU availability)
- Use
!nvidia-smito check GPU allocation if needed
- Use open-source models (Hugging Face) or provide API keys instructions
- For large models, use quantized versions or API access
- Include code to automatically download datasets
- Document any manual setup steps clearly
- Set random seeds for reproducible results
- Include version numbers for key libraries
- Save and load model checkpoints appropriately
- Document any hyperparameters used
- Use markdown cells liberally to explain your approach
- Include docstrings for functions
- Comment complex code sections
- Provide a "Quick Start" section at the top of your notebook
- Include a table of contents for longer notebooks
This final project is your chance to create something remarkable. You've learned about the full spectrum of language models—from simple pattern matching to sophisticated transformers. You've implemented these systems, evaluated them, and thought critically about their capabilities and limitations.
Now it's time to apply that knowledge to a problem you care about. With GenAI tools at your disposal, you can tackle projects that would have been impossible for a student team just a few years ago. Take advantage of this unique moment in history.
The best projects will:
- Demonstrate technical sophistication and mastery
- Provide novel insights or create genuinely useful systems
- Show critical thinking about AI capabilities and limitations
- Be executed with care, rigor, and attention to detail
- Communicate results clearly and engagingly
We're excited to see what you create. Good luck, and remember: ambitious goals combined with thoughtful execution lead to extraordinary results.
Questions? Use the course Discord or attend office hours. We're here to help you succeed.
Ready to start? Form your team, start brainstorming, and prepare to build something amazing.