INTRODUCTION
Welcome to 6.861*!
Our world is clearly inundated with digitized information nowadays, most of which is text (e.g., webpages, social media posts, news articles, long-format stories, etc). How can we use computers to “understand”, process, and leverage this information to perform meaningful tasks? Natural Language Processing (NLP) is concerned with exactly this. This course will provide you with the opportunity to learn the latest, advanced machine learning/ deep learning approaches to solve the most popular and powerful NLP tasks. Further, a core component of the course will concern research, as you’ll produce an original research project while working in groups of 3-4 students.
LEARNING OBJECTIVES
By the end of the course, you will be able to:
- understand the theoretical concepts behind the common NLP tasks and models
- write effective programming solutions to popular problems in NLP
- tackle your own, novel goals with text data once this course is over (e.g., if you have downloaded thousands of tweets over the past week, you’ll be able to come up with reasonable solutions to (1) identify sentiments about any phrase; (2) make classification predictions; (3) identify aliases for any entity, and much more)
- conduct substantial, original NLP research (e.g., critically read papers published in top conferences, understand them, and execute your own ideas so as to answer novel research questions)
PREREQUISITES
- NLP: No previous experience expected or necessary
- Machine Learning: basic knowledge of Machine Learning (e.g., Feed-forward Neural Nets, Backpropagation, what train/dev/test splits are, regularization) e.g., 6.390 aka 6.036
- Probability and Statistics: (e.g., 6.370, 6.380, 18.05)
- Multivariable Calculus: (e.g., 18.02)
- Linear Algebra): (e.g., 18.061)
- Algorithms: (e.g., 6.120, 6.121)
- Programming: knowledge of Python and at least one class with substantial object-oriented programming (e.g., 6.100A)
SOFTWARE
You will program using (a) Python and (b) PyTorch, extensively. If you are not already familiar with PyTorch, you will be expected to learn it on your own. We will also make use of Python libraries such as NumPy and Scikits-learn.
COURSE STRUCTURE
The main delivery of information will be via Lectures, which will occur every time class meets (aside from Research Project presentations). Your learning will be assessed via three homework assignments, a midterm exam, and a significant research project (in groups of 3-4 students). Your research project will require you to read/skim dozens of research papers on your own, based on your interests.
LECTURES
Every class session will contain a lecture. Lectures will concern:
- Lecture 1: Introduction + ML Basics
- Lecture 2: Classification (linear models, neural nets)
- Lecture 3: Sequence models 1 (ngrams, log-linear LMs, word2vec)
- Lecture 4: Sequence models 2 (RNNs)
- Lecture 5: Sequence models 3 (seq2seq + attention)
- Lecture 6: Transformers
- Lecture 7: Pretraining 1 (BERT and GPT)
- Lecture 8: Pretraining 2 (SFT and RLHF)
- Lecture 9: Efficient training (MoE, quantization, LoRA)
- Lecture 10: Doing research
- Lecture 11: Decoding 1 (prompting, CoT, and agents)
- Lecture 12: Decoding 2 (search and sampling)
- Lecture 13: Multimodality
- Lecture 14: Midterm review
- Lecture 15: Midterm
- Lecture 16: Beyond transformers (SSMs, Mamba, …)
- Lecture 17: NLP Engineering
- Lecture 18: Interpretability
- Lecture 19: Guest Lecture: Speech
- Lecture 20: Struct pred
- Lecture 21: Guest Lecture: Intellectual property (SERC)
- Lecture 22: Bias & fairness
- Lecture 23: Human language processing
- Lecture 24: Conclusion
HOMEWORK ASSIGNMENTS (30%)
There will be three equally-weighted, individual homework assignments. See the Collaboration Policy below for details. Students will have a total of three free late days to use throughout the semester without any penalty. NOTE: valid excuses (e.g., medical excuses) do not count toward your three allotted “free” late days. Any late days used beyond these three will result in a deduction of 10 points per day.
For example, let’s say a student has already used three free late days earlier in the semester, and then turned in another homework assignment one day late. If the graded homework received a 88%, then it will be reduced to a 78% due to being a day late. If that student had turned in the assignment two days late (meaning, a grand total of five late days used), then that particular assignment would have received a 68%, due to being two days late.
If any particular homework assignment is late beyond three unexcused days, it will not be accepted or graded. The grade will be 0%.
SPECIAL TOPIC RESPONSES (5%)
Each student is expected to attend every guest lecture and submit a short write-up (maximum of 1-2 paragraphs) the mentions a few items you liked or learned from the talk, along with a few items you either had trouble understanding or would have liked to see differently from the talk. This should not be a surface-level summary of the talk, and this should be your own original words – not that from a LLM.
MIDTERM EXAM (20%)
The midterm is intended to assess students’ knowledge of foundational content. It will be conducted in class (Oct 29) on paper, closed-book, and will include a combination of multiple-choice and free-response questions. The midterm will not ask students to write any code on paper. Further details will be presented closer to the exam date.
RESEARCH PROJECT (45%)
Throughout the semester, students will work in groups of three or four on a research project of their choosing. To help facilitate ideas for projects, we will maintain an on-going, collaborative list from all students.
Project assessment (percentages listed below are out of the total course grade):
- Phase 1: Proposal (10%) (due 10/15/2024)
- Phase 2: Related Work + Introduction (ungraded)
- [OPTIONAL] Self-/peer- check-in (ungraded)
- Phase 3: Paper Progress Report (5%) (due 11/12/2024)
- Phase 4: FINAL DELIVERABLES:
- research paper (5-8 pages for groups of 3 students; 6-8 pages for groups of 4 students) (20%) (due 12/10/2024)
- final poster (uploaded by 12/5/2024)
- poster session (10%) (12/5/2024 and 12/10/2024)
- code (ungraded)
- [OPTIONAL] Self-/peer- check-in
Further details are listed here and will be discussed in class.
RESOURCES
No textbook is required.
- Canvas: homework assignments
- Piazza (accessible from Canvas): announcements, technical discussion, and clarifying questions
- Project Ideas (coming soon): on-going spreadsheet to collaboratively find and create research projects
- Emergency Helpline: for private concerns, issues, and questions (not course-content related)
- Supplemental Resources: a compilation of useful, external resources
HELP
Students are encouraged to regularly read and actively participate in Piazza. This is intended to be a shared learning environment for all students in the class. Do not post any code on the forum, though. Doing so will violate our academic integrity policy.
TAs will hold scheduled Office Hours every week. The expectation is that you’ve already thought about the content and have very specific questions. That is, the intent of Office Hours is not to simply repeat the same content again, but to help clarify your understanding by addressing issues you’re currently facing. The TAs may help you with your code, if you’ve demonstrated that you have already put forth significant effort and aren’t relying on them. In an attempt to help as many students as I can in a timely fashion, I will not assist in code.
COURSE POLICIES
COLLABORATION POLICY
The homework assignments must be conducted individually. However, no single student should feel alone in the course. So, we encourage you to talk with and discuss the assignments with your fellow classmates, but this must be at the conceptual level. That is, no student should ever see another student’s solutions or code. Your code must be written exclusively by you. If you post or share your homework assignment online (even if it only contains the questions and not solutions), this violates our academic policy and you will be reported to the university. This includes posting your assignment on GitHub. Do not do this. In other words, your homework assignment is a private copy that only you can see. If you’re unsure if something is allowed, please speak with us first. Any violation to the above constitutes Academic Dishonesty and will be reported.
We discourage you from using publicly-available code online, as you’ll learn more if you write your code from scratch. However, if you find useful code online that you wish to use, that is perfectly fine, but you must cite it.
We do not allow using Generative AI (e.g., ChatGPT, Copilot, etc). Evidence of such will voilate our academic pollicy.
As a reminder, if a student cheats, it is not only harmful to one’s own education but it also impacts everyone else in the course – as it creates an unfair environment and sacrifices the integrity of the entire course. For this reason, we actively check to ensure your code hasn’t been plagiarized or posted online.
CI RECITATION ABSENSE POLICY
Students are allowed two (2) absences for recitation. For each additional absence after the 2nd, a student may receive a 1% deduction of the communication portion of the student’s Research Paper final draft grade. This deduction applies to individuals, not to groups. Instructors may excuse an absence if the student (1) attends a different recitation in the same week as the recitation they missed and notifies their recitation instructor and the instructor of the recitation they attended; OR, (2) makes arrangements with their recitation instructor to excuse the absence.
ACADEMIC HONESTY
Ethical behavior is an important trait of anyone who works in the fields of Computer Science, Machine Learning, Data Science, NLP – from ethically handling data, to thinking of the ramifications of one’s models, to attribution of code and work of others. Thus, in this course, we place strong emphasis on Academic Honesty.
COMMUNICATION FROM STAFF TO STUDENTS
Class announcements will be through Canvas