To my linguistics homepage

LING 581
Advanced Computational Linguistics
Spring 2024

This on-campus course continues the introductory LING/C SC/PSYC 538 Computational Linguistics¹. This is a course designed to give students more in-depth knowledge and hands-on experience with technique and software than is possible in 538.

Part of this course will involve more advanced and in-depth exploration of fundamental topics covered in 538, e.g. with respect to writing grammars.

The larger part of the course involves projects using software packages.

As part of the course, students will be expected to develop the skills to install, run and perform project work on their own machines.

Projects:

Parsing algorithms.
Treebanks (phrase-structure/dependency-based): e.g. Penn Treebank, lookup software.
Part-of-speech taggers.
The use and modification of statistical parsers trained on Treebanks
Advanced linguistic theories: the Minimalist Program
Ontologies and Semantic Networks: WordNet etc.
Question-Answering (QA)
more...

¹Note: 538 is a pre-requisite for this class. For on-campus students, 538 is offered in Fall semesters only.

Software

We will make use of programming languages Python (3.x), Perl and Prolog, corpora from the LDC and other sources. Plus other software packages, e.g. (Java-based) parsers. All software used will be freely available.
Students are expected to have their own laptop. And possess sufficient privileges to install packages on their machines. Linux and MacOS will be supported. Only partial support for Windows 10, students on Windows PC should install Linux as well.

Grading

Students will be given a series of tasks to accomplish. Satisfactory completion of all tasks will result in a superior grade.

Readings

Required reading will be from the draft version of the 538 course textbook Speech and Language Processing 3rd edition, (Jurafsky & Martin), and in the form of project documentation (manuals) and papers and/or dissertations to be made available on-line.

Instructor: Sandiway Fong sandiway AT arizona.edu
Office: 311 Douglass

Administrivia

Location	Modern Languages 203
Time	Tuesday-Thursday 9:30-10:45 am

Syllabus

See lecture 1 slides and syllabus.pdf.

Lecture Notes

Available in both Adobe PDF and Microsoft Powerpoint formats.

January

Date	Lecture Notes		Number of Slides	Panopto	Topic
Date	PDF	Powerpoint	Number of Slides	Panopto	Topic
1/11	lecture1.pdf	lecture1.pptx	51	Viewer	Syllabus. Scheduling. Computational Requirements. WSL2 on Windows 10/11. Xcode and command line tools. nltk and Python 3. AT&T 1953 prediction about translation. Apple's Neural Engine. U. of Arizona and Wonder: ChatGPT and Google. A brief peek at the next lecture.
1/16	lecture2.pdf	lecture2.pptx	78	Viewer	The Large Language Model (LLM) lecture: the Good, the Bad and the Ugly. ChatGPT and other models. Slides updated: 6am Jan 18th
1/18	lecture3.pdf	lecture3.pptx	6	Viewer	The Large Language Model (LLM) lecture: contd. Homework 2. 6pm: corrected slide for due date.
1/23	lecture4.pdf	lecture4.pptx	30	Viewer	Homework 2 Review. SWI-Prolog revisited: cheat sheet. Definite clause grammar (DCG) rules. Language membership question and enumeration. Extra argument for nonterminals: recovering a parse tree. Examples: apbp.prolog sheeptalk.prolog sheeptalk2.prolog Slides updated: 11:05am
1/25	lecture5.pdf	lecture5.pptx	31	Viewer	Context-sensitive grammars in Prolog: three methods. The case of the context-sensitive language {aⁿbⁿcⁿ \| n> 0} abc_parse.prolog abc_count.prolog abc_cs.prolog Homework 3.
1/30	lecture6.pdf	lecture6.pptx	27	Viewer	Homework 3 review. The Cross Serial Dependencies lecture. Developing the Prolog context-sensitive grammar in 4 stages. g1.prolog g2.prolog g3.prolog g4.prolog Terminal log: terminal6.txt

February

Date	Lecture Notes		Number of Slides	Panopto	Topic
Date	PDF	Powerpoint	Number of Slides	Panopto	Topic
2/1	lecture7.pdf	lecture7.pptx	24	Viewer	Turn to writing our own CFGs for natural language: Agreement. The problem with Prolog left recursion. A grammar transformation: left recursive to right recursive BUT structure preserving. Homework 4. Files: nl1.prolog / nl2.prolog / nl3.prolog left.prolog / left2.prolog Terminal log: terminal7.txt
2/6	lecture8.pdf	lecture8.pptx	21	Viewer	Homework 4 Aside: stacking PPs. Anaphora Homework 4 Questions?
2/8	lecture9.pdf	lecture9.pptx	35	Viewer	Homework 4 Review: live programming. Other systems mentioned: Berkeley Neural Parser, ChatGPT, DALL-E 2, and FrameNet. Files: x.prolog / xt.prolog / npt.prolog Slides updated: 11:30am
2/13	lecture10.pdf	lecture10.pptx		Viewer	Lecture canceled due to sickness.
2/15	lecture11.pdf	lecture11.pptx	16	Viewer	Homework 4 Review: live programming completed. Homework 5
2/20	lecture12.pdf	lecture12.pptx	40	Viewer	Homework 5 Review. Memorization aka Dynamic Programming: example of Fibonacci numbers. File: fibonacci.prolog The CKY algorithm for Context-Free Parsing. CFG parsing in nltk. File: cnf.txt Terminal log: terminal12.txt
2/22	lecture13.pdf	lecture13.pptx	34	Viewer	Announcement about next week. Homework 6 Dotted rules. The Shift Reduce Parsing Algorithm.
2/27	lecture14.pdf	lecture14.pptx		Viewer	Pre-recorded Lecture. Crash Blossoms Homework 7. Other Popular Parsers: CoreNLP, Stanza and Berkeley Neural Parser. Homework: install Stanza into nltk, find Crash Blossoms, and run them on parsers.
2/29				Viewer	No Lecture. See Lecture 14.

March

Date	Lecture Notes		Number of Slides	Panopto	Topic
Date	PDF	Powerpoint	Number of Slides	Panopto	Topic
3/5				Viewer	Spring Break: no class
3/7				Viewer	Spring Break: no class
3/12	lecture15.pdf	lecture15.pptx		Viewer	Lecture canceled.
3/14	lecture16.pdf	lecture16.pptx	27	Viewer	Poll: any interest in my Keio lecture? Crash Blossoms workflow. A possible practical use. WordNet: introduction. Use on nltk.
3/19	lecture17.pdf	lecture17.pptx	24	Viewer	WordNet: similarity. word2vec: similarity. Using gensim. hypernyms.py Terminal log: terminal17.txt
3/21	lecture18.pdf	lecture18.pptx	32	Viewer	A bit more on vector arithmetic and word embeddings. A look at an example of possible uses of WordNet: Semantic Opposition. Homework 8 (simple practice with the browser) Terminal log: terminal18.txt
3/26	lecture19.pdf	lecture19.pptx	19	Viewer	Homework 8 Review A look at an example of possible uses of WordNet: Semantic Opposition and lexical semantics.
3/28	lecture20.pdf	lecture20.pptx	20	Viewer	Semantic Opposition: ChatGPT vs. WordNet Breadth-first Search. Live demo! bfs.py Terminal log: terminal20.txt

April

Date	Lecture Notes		Number of Slides	Panopto	Topic
Date	PDF	Powerpoint	Number of Slides	Panopto	Topic
4/2	lecture21.pdf	lecture21.pptx	25	Viewer	Install the Penn Treebank (PTB)corpus from the course website. Homework 9: install tregex PTB POS tagset and Syntax tagsets. Introduction to tregex. Terminal log: terminal21.txt Slides updated: 11am
4/4	lecture22.pdf	lecture22.pptx	30	Viewer	Assume everyone has downloaded TREEBANK_3.zip. Installing the full PTB into nltk: usage from nltk.corpus import ptb tregex: macOS possible problem tregex: tutorial part 2 on searching Terminal log: terminal22.txt
4/9	lecture23.pdf	lecture23.pptx	22	Viewer	tregex: tutorial part 3 on searching Homework 10 Whiteboard images: 1 2 3 Terminal log: terminal23.txt
4/11	lecture24.pdf	lecture24.pptx	25	Viewer	tregex on Windows 11 and Ubuntu: make sure jre is installed. Homework 10 questions? TREEBANK_3 Theory: empty categories Example: subject and object relative clauses. Summary of tregex search
4/16	lecture25.pdf	lecture25.pptx	41	Viewer	Homework 10 Review Statistical Parsers: unimplemented theory Big unanswered question: why does break have so many different senses? Let's take look at the verb break using nltk ptb. How many examples of break are there? Can we extract the verb frames?
4/18	lecture26.pdf	lecture26.pptx	14	Viewer	Optional homeworks 11 and 12. On corpus ptb from nltk. Make sure it's working properly before attempting the optional homeworks. Terminal log: terminal26.txt Slides modified: 11:10am
4/23	lecture27.pdf	lecture27.pptx	28	Viewer	More on nltk and ptb: (1) Zipf's law, (2) productions, and (3) looking for words with multiple POS tags and graphing the distribution. Python program: zipf.py
4/25	lecture28.pdf	lecture28.pptx	31	Viewer	Revisiting the verb break: an example project pipeline. Is the data out there rich? Comparing the ptb with the Childes database. Berkeley Neural Parser: benepar installation and use with nltk and spaCy. The Childes database: pre-processing a .csv file.

May

Date	Lecture Notes		Number of Slides	Panopto	Topic
Date	PDF	Powerpoint	Number of Slides	Panopto	Topic
5/1	lecture29.pdf	lecture29.pptx	22	Viewer	Reminder aout optional homeworks. Another experiment with the ptb. A code walk-through. Treebanks provide statistics about CFG rules and how often they appear in the corpus. Suppose we add rules in order from most frequent on down until we can parse a (given) sentence. What do the parses look like?