To my linguistics homepage

LING 581
Advanced Computational Linguistics
Spring 2024

This on-campus course continues the introductory LING/C SC/PSYC 538 Computational Linguistics1. This is a course designed to give students more in-depth knowledge and hands-on experience with technique and software than is possible in 538.

Part of this course will involve more advanced and in-depth exploration of fundamental topics covered in 538, e.g. with respect to writing grammars.

The larger part of the course involves projects using software packages.

As part of the course, students will be expected to develop the skills to install, run and perform project work on their own machines.

Projects:

  1. Parsing algorithms.
  2. Treebanks (phrase-structure/dependency-based): e.g. Penn Treebank, lookup software.
  3. Part-of-speech taggers.
  4. The use and modification of statistical parsers trained on Treebanks
  5. Advanced linguistic theories: the Minimalist Program
  6. Ontologies and Semantic Networks: WordNet etc.
  7. Question-Answering (QA)
  8. more...

1Note: 538 is a pre-requisite for this class. For on-campus students, 538 is offered in Fall semesters only.

Software

We will make use of programming languages Python (3.x), Perl and Prolog, corpora from the LDC and other sources. Plus other software packages, e.g. (Java-based) parsers. All software used will be freely available.
Students are expected to have their own laptop. And possess sufficient privileges to install packages on their machines. Linux and MacOS will be supported. Only partial support for Windows 10, students on Windows PC should install Linux as well.

Grading

Students will be given a series of tasks to accomplish. Satisfactory completion of all tasks will result in a superior grade.

Readings

Required reading will be from the draft version of the 538 course textbook Speech and Language Processing 3rd edition, (Jurafsky & Martin), and in the form of project documentation (manuals) and papers and/or dissertations to be made available on-line.

Instructor: Sandiway Fong sandiway AT arizona.edu
Office: 311 Douglass

Administrivia

Location Modern Languages 203
Time Tuesday-Thursday 9:30-10:45 am

Syllabus

See lecture 1 slides and syllabus.pdf.

Lecture Notes

Available in both Adobe PDF and Microsoft Powerpoint formats.

January

Date Lecture Notes Number
of Slides
Panopto Topic
PDF Powerpoint
1/11 lecture1.pdf lecture1.pptx 51 Viewer Syllabus. Scheduling. Computational Requirements. WSL2 on Windows 10/11.
Xcode and command line tools. nltk and Python 3.
AT&T 1953 prediction about translation. Apple's Neural Engine.
U. of Arizona and Wonder: ChatGPT and Google.
A brief peek at the next lecture.
1/16 lecture2.pdf lecture2.pptx 78 Viewer The Large Language Model (LLM) lecture: the Good, the Bad and the Ugly.
ChatGPT and other models.
Slides updated: 6am Jan 18th
1/18 lecture3.pdf lecture3.pptx 6 Viewer The Large Language Model (LLM) lecture: contd.
Homework 2.
6pm: corrected slide for due date.
1/23 lecture4.pdf lecture4.pptx 30 Viewer Homework 2 Review.
SWI-Prolog revisited: cheat sheet.
Definite clause grammar (DCG) rules.
Language membership question and enumeration.
Extra argument for nonterminals: recovering a parse tree.
Examples:
apbp.prolog
sheeptalk.prolog
sheeptalk2.prolog
Slides updated: 11:05am
1/25 lecture5.pdf lecture5.pptx 31 Viewer Context-sensitive grammars in Prolog: three methods.
The case of the context-sensitive language {anbncn | n> 0}
abc_parse.prolog
abc_count.prolog
abc_cs.prolog
Homework 3.
1/30 lecture6.pdf lecture6.pptx 27 Viewer Homework 3 review.
The Cross Serial Dependencies lecture.
Developing the Prolog context-sensitive grammar in 4 stages.
g1.prolog
g2.prolog
g3.prolog
g4.prolog
Terminal log: terminal6.txt

February

Date Lecture Notes Number
of Slides
Panopto Topic
PDF Powerpoint
2/1 lecture7.pdf lecture7.pptx 24 Viewer Turn to writing our own CFGs for natural language:
  1. Agreement.
  2. The problem with Prolog left recursion.
  3. A grammar transformation: left recursive to right recursive BUT structure preserving.
Homework 4.
Files:
nl1.prolog / nl2.prolog / nl3.prolog
left.prolog / left2.prolog
Terminal log: terminal7.txt
2/6 lecture8.pdf lecture8.pptx 21 Viewer Homework 4 Aside: stacking PPs.
Anaphora
Homework 4 Questions?
2/8 lecture9.pdf lecture9.pptx 35 Viewer Homework 4 Review: live programming.
Other systems mentioned: Berkeley Neural Parser, ChatGPT, DALL-E 2, and FrameNet.
Files:
x.prolog / xt.prolog / npt.prolog
Slides updated: 11:30am
2/13 lecture10.pdf lecture10.pptx Viewer Lecture canceled due to sickness.
2/15 lecture11.pdf lecture11.pptx 16 Viewer Homework 4 Review: live programming completed.
Homework 5
2/20 lecture12.pdf lecture12.pptx 40 Viewer Homework 5 Review.
Memorization aka Dynamic Programming: example of Fibonacci numbers.
File: fibonacci.prolog
The CKY algorithm for Context-Free Parsing.
CFG parsing in nltk.
File: cnf.txt
Terminal log: terminal12.txt
2/22 lecture13.pdf lecture13.pptx 34 Viewer Announcement about next week.
Homework 6
Dotted rules.
The Shift Reduce Parsing Algorithm.
2/27 lecture14.pdf lecture14.pptx Viewer Pre-recorded Lecture. Crash Blossoms Homework 7.
Other Popular Parsers: CoreNLP, Stanza and Berkeley Neural Parser.
Homework: install Stanza into nltk, find Crash Blossoms, and run them on parsers.
2/29     Viewer No Lecture. See Lecture 14.

March

Date Lecture Notes Number
of Slides
Panopto Topic
PDF Powerpoint
3/5 Viewer Spring Break: no class
3/7 Viewer Spring Break: no class
3/12 lecture15.pdf lecture15.pptx Viewer Lecture canceled.
3/14 lecture16.pdf lecture16.pptx 27 Viewer Poll: any interest in my Keio lecture?
Crash Blossoms workflow.
A possible practical use.
WordNet: introduction. Use on nltk.
3/19 lecture17.pdf lecture17.pptx 24 Viewer WordNet: similarity.
word2vec: similarity. Using gensim.
hypernyms.py
Terminal log:
terminal17.txt
3/21 lecture18.pdf lecture18.pptx 32 Viewer A bit more on vector arithmetic and word embeddings.
A look at an example of possible uses of WordNet: Semantic Opposition.
Homework 8 (simple practice with the browser)
Terminal log: terminal18.txt
3/26 lecture19.pdf lecture19.pptx 19 Viewer Homework 8 Review
A look at an example of possible uses of WordNet: Semantic Opposition and lexical semantics.
3/28 lecture20.pdf lecture20.pptx 20 Viewer Semantic Opposition: ChatGPT vs. WordNet Breadth-first Search.
Live demo!
bfs.py
Terminal log: terminal20.txt

April

Date Lecture Notes Number
of Slides
Panopto Topic
PDF Powerpoint
4/2 lecture21.pdf lecture21.pptx 25 Viewer Install the Penn Treebank (PTB)corpus from the course website.
Homework 9: install tregex
PTB POS tagset and Syntax tagsets.
Introduction to tregex.
Terminal log: terminal21.txt
Slides updated: 11am
4/4 lecture22.pdf lecture22.pptx 30 Viewer Assume everyone has downloaded TREEBANK_3.zip.
Installing the full PTB into nltk: usage
from nltk.corpus import ptb
tregex: macOS possible problem
tregex: tutorial part 2 on searching
Terminal log: terminal22.txt
4/9 lecture23.pdf lecture23.pptx 22 Viewer tregex: tutorial part 3 on searching
Homework 10
Whiteboard images: 1 2 3
Terminal log: terminal23.txt
4/11 lecture24.pdf lecture24.pptx 25 Viewer tregex on Windows 11 and Ubuntu: make sure jre is installed.
Homework 10 questions?
TREEBANK_3 Theory: empty categories
Example: subject and object relative clauses.
Summary of tregex search
4/16 lecture25.pdf lecture25.pptx 41 Viewer Homework 10 Review
Statistical Parsers: unimplemented theory
Big unanswered question: why does break have so many different senses?
Let's take look at the verb break using nltk ptb. How many examples of break are there? Can we extract the verb frames?
4/18 lecture26.pdf lecture26.pptx 14 Viewer Optional homeworks 11 and 12.
On corpus ptb from nltk. Make sure it's working properly before attempting the optional homeworks.
Terminal log: terminal26.txt
Slides modified: 11:10am
4/23 lecture27.pdf lecture27.pptx 28 Viewer More on nltk and ptb: (1) Zipf's law, (2) productions, and (3) looking for words with multiple POS tags and graphing the distribution.
Python program: zipf.py
4/25 lecture28.pdf lecture28.pptx 31 Viewer Revisiting the verb break: an example project pipeline.
Is the data out there rich? Comparing the ptb with the Childes database.
Berkeley Neural Parser: benepar installation and use with nltk and spaCy.
The Childes database: pre-processing a .csv file.

May

Date Lecture Notes Number
of Slides
Panopto Topic
PDF Powerpoint
5/1 lecture29.pdf lecture29.pptx 22 Viewer Reminder aout optional homeworks.
Another experiment with the ptb. A code walk-through.
Treebanks provide statistics about CFG rules and how often they appear in the corpus.
Suppose we add rules in order from most frequent on down until we can parse a (given) sentence.
What do the parses look like?


To my linguistics homepage