Syllabus

INTRODUCTION

Welcome to 6.861*!

Our world is clearly inundated with digitized information nowadays, most of which is text (e.g., webpages, social media posts, news articles, long-format stories, etc). How can we use computers to “understand”, process, and leverage this information to perform meaningful tasks? Natural Language Processing (NLP) is concerned with exactly this. This course will provide you with the opportunity to learn the latest, advanced machine learning/ deep learning approaches to solve the most popular and powerful NLP tasks. Further, a core component of the course will concern research, as you’ll produce an original research project while working in groups of 3-4 students.

LEARNING OBJECTIVES

By the end of the course, you will be able to:

  • understand the theoretical concepts behind the common NLP tasks and models
  • write effective programming solutions to popular problems in NLP
  • tackle your own, novel goals with text data once this course is over (e.g., if you have downloaded thousands of tweets over the past week, you’ll be able to come up with reasonable solutions to (1) identify sentiments about any phrase; (2) make classification predictions; (3) identify aliases for any entity, and much more)
  • conduct substantial, original NLP research (e.g., critically read papers published in top conferences, understand them, and execute your own ideas so as to answer novel research questions)

PREREQUISITES

  • NLP: No previous experience expected or necessary
  • Machine Learning: basic knowledge of Machine Learning (e.g., Feed-forward Neural Nets, Backpropagation, what train/dev/test splits are, regularization) e.g., 6.390 aka 6.036
  • Probability and Statistics: (e.g., 6.370, 6.380, 18.05)
  • Multivariable Calculus: (e.g., 18.02)
  • Linear Algebra): (e.g., 18.061)
  • Algorithms: (e.g., 6.120, 6.121)
  • Programming: knowledge of Python and at least one class with substantial object-oriented programming (e.g., 6.100A)

SOFTWARE

You will program using (a) Python and (b) PyTorch, extensively. If you are not already familiar with PyTorch, you will be expected to learn it on your own. We will also make use of Python libraries such as NumPy and Scikits-learn.

COURSE STRUCTURE

The main delivery of information will be via Lectures, which will occur every time class meets (aside from Research Project presentations). Your learning will be assessed via three homework assignments, a mid-term exam, and a significant research project (in groups of 3-4 students). Your research project will require you to read/skim dozens of research papers on your own, based on your interests.

LECTURES

Every class session will contain a lecture. Lectures will concern:

  • Lecture 1: Introduction + ML Basics (Logistics Regression; SGD)
  • Lecture 2: Text Classification (linear classifier; BoW; TFIDF)
  • Lecture 3: Word Representations (matrix factorization; word2vec)
  • Lecture 4: Language Modelling (MLP; RNN)
  • Lecture 5: Attention
  • Lecture 6: Transformers Part 1
  • Lecture 7: Transformers Part 2
  • Lecture 8: Large Language Models (LLMs) Part 1
  • Lecture 9: Large Language Models (LLMs) Part 2
  • Lecture 10: Structured Models: Hidden Markov Models (HMMs)
  • Lecture 11: Structured Models: Trees
  • Lecture 12: Structured Models: Conditional Random Fields (CRFs)
  • Lecture 13: Structured Models: Latent Variable Models
  • Lecture 14: Mid-term Review
  • Lecture 15: Doing Research
  • Lecture 16: NLP Engineering
  • Lecture 17: Guest lecture: Ethics and NLP
  • Lecture 18: Interpretability
  • Lecture 19: Guest Lecture: Speech
  • Lecture 20: Guest Lecture: Human Language Processing
  • Lecture 21: Guest Lecture: SERC Ethics
  • Lecture 22: Guest lecture: TBD
  • Lecture 23: Conclusion

HOMEWORK ASSIGNMENTS (30%)

There will be three equally-weighted, individual homework assignments. See the Collaboration Policy below for details. Students will have a total of three free late days to use throughout the semester without any penalty. NOTE: valid excuses (e.g., medical excuses) do not count toward your three allotted “free” late days. Any late days used beyond these three will result in a deduction of 10 points per day.

For example, let’s say a student has already used three free late days earlier in the semester, and then turned in another homework assignment one day late. If the graded homework received a 88%, then it will be reduced to a 78% due to being a day late. If that student had turned in the assignment two days late (meaning, a grand total of five late days used), then that particular assignment would have received a 68%, due to being two days late.

If any particular homework assignment is late beyond three unexcused days, it will not be accepted or graded. The grade will be 0%.

MID-TERM EXAM (25%)

The midterm is intended to assess students’ knowledge of foundational content. It will be conducted in class (Oct 31) on paper, closed-book, and will include a combination of multiple-choice and free-response questions. The midterm will not ask students to write any code on paper. Further details will be presented closer to the exam date.

RESEARCH PROJECT (45%)

Throughout the semester, students will work in groups of three on a research project of their choosing. To help facilitate ideas for projects, we will maintain an on-going, collaborative list from all students.

Project assessment (percentages listed below are out of the total course grade):

  • Phase 1: Proposal (10%)
  • Phase 2: Related Work + Introduction (ungraded)
  • [OPTIONAL] Self-/peer- check-in (ungraded)
  • Phase 3: Paper Progress Report (5%)
  • Phase 4: FINAL DELIVERABLES:
    • research paper (5-8 pages for groups of 3 students; 6-8 pages for groups of 4 students) (20%)
    • poster session (10%)
    • code (ungraded)
    • [OPTIONAL] Self-/peer- check-in

Further details are listed here and will be discussed in class.

RESOURCES

No textbook is required.

HELP

Students are encouraged to regularly read and actively participate in Piazza. This is intended to be a shared learning environment for all students in the class. Do not post any code on the forum, though. Doing so will violate our academic integrity policy.

TAs will hold scheduled Office Hours throughout every week. The expectation is that you’ve already thought about the content and have very specific questions. That is, the intent of Office Hours is not to simply repeat the same content again, but to help clarify your understanding by addressing issues you’re currently facing. The TAs may help you with your code, if you’ve demonstrated that you have already put forth significant effort and aren’t relying on them. In an attempt to help as many students as I can in a timely fashion, I will not assist in code.

COURSE POLICIES

COLLABORATION POLICY

The homework assignments must be conducted individually. However, no single student should feel alone in the course. So, we encourage you to talk with and discuss the assignments with your fellow classmates, but this must be at the conceptual level. That is, no student should ever see another student’s solutions or code. Your code must be written exclusively by you. If you post or share your homework assignment online (even if it only contains the questions and not solutions), this violates our academic policy and you will be reported to the university. This includes posting your assignment on GitHub. Do not do this. In other words, your homework assignment is a private copy that only you can see. If you’re unsure if something is allowed, please speak with us first. Any violation to the above constitutes Academic Dishonesty and will be reported.

We discourage you from using publicly-available code online, as you’ll learn more if you write your code from scratch. However, if you find useful code online that you wish to use, that is perfectly fine, but you must cite it.

We do not allow using Generative AI (e.g., ChatGPT, Copilot, etc). Evidence of such will voilate our academic pollicy.

As a reminder, if a student cheats, it is not only harmful to one’s own education but it also impacts everyone else in the course – as it creates an unfair environment and sacrifices the integrity of the entire course. For this reason, we actively check to ensure your code hasn’t been plagiarized or posted online.

ACADEMIC HONESTY

Ethical behavior is an important trait of anyone who works in the fields of Computer Science, Machine Learning, Data Science, NLP – from ethically handling data, to thinking of the ramifications of one’s models, to attribution of code and work of others. Thus, in this course, we place strong emphasis on Academic Honesty.

COMMUNICATION FROM STAFF TO STUDENTS

Class announcements will be through Canvas