CS 124: From Languages to Information

Winter 2014 Dan Jurafsky

The web is a vast world of unstructured information — text and speech in multiple languages, social networks, tags, and all sorts of human interactions. Learn how to make sense of it!

Schedule Coursera Material Piazza Forum

Online Offering

From Languages to Information is offered online!

What this means:

Schedule

Week Date Homework In-class Video Lectures and Readings
1 Jan 7 and 9 -
  • Tue: Intro Lecture* [pptx]

    [pdf]

  • Thurs: Group Work: Text Processing with Unix tools [pptx] [pdf]
Basic Text Processing [slides pptx] [slides pdf]
  • J+M Section 2.1 Regular Expressions (17-26)
  • J+M section 3.9 Word and Sentence Tokenization (68-72)
  • MR+S Chapter 2: Term vocabulary and postings lists (only pages 18-33)
  • Ken Church's tutorial Unix for Poets, pages 1-19
Edit Distance [slides pptx] [slides pdf]
  • J+M section 3.11: Minimum Edit Distance (pages 72-77)
2 Jan 14 and 16

Homework 1: Spamlord

Due Fri Jan 17, 5:00pm

    Tuesday: No group exercises: open office hours in classroom
Thursday, Jan 16: Guest Lecturer*: Rob Munro, Idibon: "NLP for Social Good" [pptx]  [thoughts]
Language Modeling [slides pptx] [slides pdf] (skip the video/slides on Good Turing Smoothing)
  • J+M Chapter 4, N-grams, pages 83-100 (4.1-4.5.1), 104 (4.6), 109-111 (4.9.1). The rest of the chapter is optional advanced reading.
Spelling Correction and the Noisy Channel [slides pptx] [slides pdf]
  • - Peter Norvig (2007) How to Write a Spelling Corrector. Read the first half (the "Future Work" and following sections are optional)
  • - J+M Section 5.9: The Noisy Channel Model for Spelling
3 Jan 21 and 23

Homework 2: AutoCorrect!

Due Fri Jan 24, 5:00pm

Nave Bayes and Text Classification [slides pptx] [slides pdf]
  • MR+S Chapter 13: Text classification and Nave Bayes (only sections 13.0, 13.1, 13.2 ) (only pages 234-243)
Sentiment Analysis [slides pptx] [slides pdf]
4 Jan 28 and 30

Homework 3: Thumbs up!

Due Fri Jan 31, 5:00pm

  • Thursday Jan 30: Guest Lecturer*: Andrew Maas, Stanford and Coursera: "NLP for Online Education" [questions]
Information Retrieval (I) [slides pptx] [slides pdf]
  • MR+S Chapter 1: Boolean Retrieval (pages 1-17)
  • The rest of MR+S Chapter 2: Term vocabulary and postings lists (only pages 33-42)
Information Retrieval (II) [slides pptx] [slides pdf]
  • MR+S Chapter 6: Scoring, term weighting, and the vector space model, (only pages 100 and 107-116)
  • MR+S Chapter 8: Evaluation in Information Retrieval (only pages 139-149)
5 Feb 4 and 6

Homework 4: Search!

Due Fri Feb 7, 5:00pm

Tuesday: Group Work on Information Retrieval and Answer Key


Thursday Feb 6: Guest Lecturer*: Jennifer Chu-Caroll, IBM T. J. Watson Research Center [questions]
Relation Extraction [slides pptx] [slides pdf]
  • J+M Chapter 22: Information Extraction first 5 pages (pages 725-729) and section 22.2 (pages 734-743)
Question Answering [slides pptx] [slides pdf]
6 Feb 11 and 13

Homework 5: Jeopardy!

Due Fri Feb 14, 5:00pm

  • Tuesday: Group Work on Question Answering in the Mobile Domain
  • Thursday: Dan Lecture: "The Computational Social Science of Food, Language, and Scientific Innovation, or NLP goes Social"
  • Machine Translation 1 [slides pptx] [slides pdf]
    • J+M Chapter 25: Machine Translation, page 859-879 (=IE 895-915)
    Machine Translation 2 [slides pptx] [slides pdf]
    • J+M Chapter 25: Machine Translation, page 879-897 (=IE 915-933)
    7 Feb 18 and 20 Tuesday: Group Work on Machine Translation
    Word Meaning and Word Similarity [slides pptx] [slides pdf]
    • J+M Chapter 19: Lexical Semantics (pages 611-619 = IE 645-653)
    • J+M Chapter 20 Computational Lexical Semantics 20 (pages 652-670 = IE 686-704)
    8 Feb 25 and 27

    Homework 6: Translate!

    Due Fri Feb 28, 5:00pm

    Tuesday: Dan Lecture (POS Tagging), Group Work on PA 6
    Thursday: Group Work on PA 6
    9 Mar 4 and 6

    Homework 6 Peer Grading

    Due Fri Mar 7, 5:00pm

      Tuesday: Dan's Lecture on Social Networks*
    • Thursday Mar 6: Guest Lecturer*: Xiao Li, Facebook
    Web graphs, Links, and PageRank [slides pptx] [slides pdf]
    10 Mar 11 and 13 - Tuesday: Peter Norvig, Google*
    Thursday: Course Review, Discussion of Practice Final and its Solutions
    Social Networks [slides pptx] [slides pdf]
    - Mar 20 - -
    Final Exam

    You can take it either of these two days (but not both):
    Tuesday Mar 18, 12:15pm-3:15pm Location: Hewlett 201
    Thursday Mar 20, 12:15pm-3:15pm Location: CUBAUD (Cubberley Auditorium)

    (Also, here's the Practice Final and its Solutions)

    Course Information

    Logistics

    Instructor
    Dan Jurafsky (jurafsky@stanford.edu)
    Office: Margaret Jacks 117
    Office Hours: Tuesdays 2-3 or by appointment
    Teaching Assistants

    Samuel Bowman
    Milind Ganjoo
    Victoria Kwong
    Ashish Mathew
    Suril Shah
    Natalia Silveira

    TA Office Hours
    • Tuesdays 1:15 to 3:00 p.m.
    • Wednesdays 7:00 to 10:00 p.m. (Group Coding Session)
    • Thursdays 6:00 to 8:00 p.m.
    TA Office
    Huang 203 (plus Huang 219 for overflow), except that Tues Feb 11th and Thur Mar 6 we'll be in Huang 218 instead of 203.
    Class Time

    Tuesday and Thursday 3:15-4:30pm in 420-040

    Email

    If you have a question that is not confidential or personal, post it on the Piazza forum - responses tend to be quicker and have a wider audience. To contact the teaching staff directly, we strongly encourage you to come to office hours. If that is not possible, you can also email (non-technical questions only) to the course staff list, cs124-win1314-staff@lists.stanford.edu. We can not reply to email sent to individual staff members. If you have a matter to be discussed privately, please come to office hours, or use cs124-win1314-staff@lists.stanford.edu to make an appointment. For grading questions, please talk to us after class or during office hours.

    We use the mailing list generated by Axess to convey messages to the class. We will assume that all students read these messages.

    Honor Code

    Since we occasionally reuse homeworks from previous years, we expect students not to copy, refer to, or look at the solutions in preparing their answers. It is an honor code violation to intentionally refer to a previous year's solutions. This applies both to the official solutions and to solutions that you or someone else may have written up in a previous year. It is also an honor code violation to find some way to look at the test set or interfere in any way with programming assignment scoring or tampering with the submit script.

    Textbooks
    • Required: Jurafsky and Martin. 2009. Speech and Language Processing (2nd Edition). Pearson
    • Recommended: Manning, Raghavan, and Schutze. 2008. Introduction to Information Retrieval. Cambridge University Press.

    Readings from MR+S are required, but the book is available online *HERE*.

    Course Description

    Extracting meaning, information, and structure from human language text, speech, web pages, genome sequences, social networks, or any less structured information. Methods include: string algorithms, edit distance, language modeling, naive Bayes, inverted indices, vector semantics. Applications such as information retrieval, question answering, text classification, social network models, machine translation, genomic sequence alignment, word meaning extraction.

    Prerequisites

    CS 103, CS 107 and CS 109.

    Required Work

    Video Lectures

    Each week, we will ask you to watch a set of video lectures (2 to 2.5 hours total). The videos will have some in-video questions embedded in them, which you should answer. You are required to watch the videos, but the embedded quizzes are not counted toward the final grade.

    Automated Review Quizzes

    After watching a week's video lectures, we will ask you to answer an open-notes, open-book review quiz (about 5 questions) on the content that you just learned. Each review quiz may be attempted several times, with a time lag of 10 minutes in between each attempt. The questions, as well as the options for each question, are randomly selected from a larger pool each time you take a quiz. We will take the highest score over all attempts for each quiz. The first two attempts will not be penalized; subsequent attempts will incur a cumulative 20% penalty (e.g., the maximum score possible is 80% on the 3rd attempt and 60% on the 4th attempt). Review Quizzes for each week are due 11:59pm Tuesday of the following week. There are no late days for review quizzes.

    Class Participaton

    Since lectures are on-line, the in-class sessions Tuesday and Thursday mornings will be used for problem-solving, reviews, discussions, guest speakers from industry, and presentation of state-of-the-art research. Attendence at the guest lectures as well as the first lecture, my lecture on networks, and possibly one other in-person lecture is required (this is the 5% class participation part of your grade). You can get extra credit for class participation by answering questions on the class forum and asking good question of the invited speakers.

    Programming Assignments

    6 programming assignments (in Java or Python, your choice). Each assignment is due at 5:00pm on the Friday it is due.

    Programming Assignment Collaboration: You may talk to anybody you want about the assignments and bounce ideas off each other. But you must write the actual programs yourself.

    Late homeworks

    You have 4 free late (calendar) days to use on the first 5 programming assignments (HW 6 is a peer-graded assignment, and late days may not be applied). Once these are exhausted, any PA turned in late will be penalized 20% per late day. Each 24 hours or part thereof that a homework is late uses up one full late day. However, no assignment will be accepted more than four days after its due date.

    Readings

    We will expect you to do a significant amount of textbook reading in this course.

    Final exam: You can take the final either of these two days (but not both)

    Tuesday Mar 18, 12:15pm-3:15pm Location: Hewlett 201
    Thursday Mar 20, 12:15pm-3:15pm Location: CUBAUD (Cubberley Auditorium)

    Final grade
    • 57% homeworks
    • 29% final exam
    • 9% weekly review quizzes
    • 5% attendance and participation