This on-campus course continues the introductory LING/C SC/PSYC 538
Computational Linguistics1. This is a course
designed to give students more in-depth knowledge and hands-on
experience with technique and software than is possible in 538.
Part of this course will involve more advanced and in-depth
exploration of fundamental topics covered in 538, e.g. with respect
to writing grammars.
The larger part of the course involves projects using software packages.
As part of the course, students will be expected to develop the skills to install, run and perform project work on their own machines.
Projects:
- Parsing algorithms.
- Treebanks (phrase-structure/dependency-based): e.g. Penn Treebank, lookup software.
- Part-of-speech taggers.
- The use and modification of statistical parsers trained on Treebanks
- Advanced linguistic theories: the Minimalist Program
- Ontologies and Semantic Networks: WordNet etc.
- Question-Answering (QA)
- more...
1Note: 538 is a pre-requisite for this class. For
on-campus students, 538 is offered in Fall semesters only.
Software
We will make use of programming languages Python (3.x),
Perl and Prolog, corpora from the LDC and other
sources. Plus other software packages, e.g. (Java-based)
parsers. All software used will be freely available.
Students are expected to have their own laptop. And possess sufficient
privileges to install packages on their machines. Linux and MacOS will
be supported. Only partial support for Windows 10, students on
Windows PC should install Linux as well.
Grading
Students will be given a series of tasks to
accomplish. Satisfactory completion of
all tasks will result in a superior grade.
Readings
Required reading will be from the draft version of the 538 course textbook Speech and Language
Processing 3rd edition, (Jurafsky & Martin), and in the form of project documentation
(manuals) and papers and/or dissertations to be made available
on-line.
Instructor: Sandiway Fong sandiway AT arizona.edu
Office: 311 Douglass
Administrivia
Location |
Modern Languages 203 |
Time |
Tuesday-Thursday 9:30-10:45 am |
|
|
Syllabus
See lecture 1 slides and syllabus.pdf.
Lecture Notes
Available in both Adobe PDF and Microsoft Powerpoint formats.
January
February
March
April
Date |
Lecture Notes |
Number of Slides |
Panopto |
Topic |
PDF |
Powerpoint |
4/2 |
lecture21.pdf |
lecture21.pptx |
25 |
Viewer |
Install the Penn Treebank (PTB)corpus from the course website.
Homework 9: install tregex
PTB POS tagset and Syntax tagsets.
Introduction to tregex.
Terminal log: terminal21.txt
Slides updated: 11am
|
4/4 |
lecture22.pdf |
lecture22.pptx |
30 |
Viewer |
Assume everyone has downloaded TREEBANK_3.zip.
Installing the full PTB into nltk: usage
from nltk.corpus import ptb
tregex: macOS possible problem
tregex: tutorial part 2 on searching
Terminal log: terminal22.txt
|
4/9 |
lecture23.pdf |
lecture23.pptx |
22 |
Viewer |
tregex: tutorial part 3 on searching
Homework 10
Whiteboard images: 1 2 3
Terminal log: terminal23.txt
|
4/11 |
lecture24.pdf |
lecture24.pptx |
25 |
Viewer |
tregex on Windows 11 and Ubuntu:
make sure jre is installed.
Homework 10 questions?
TREEBANK_3 Theory: empty categories
Example: subject and object relative clauses.
Summary of tregex search
|
4/16 |
lecture25.pdf |
lecture25.pptx |
41 |
Viewer |
Homework 10 Review
Statistical Parsers: unimplemented theory
Big unanswered question:
why does break have so many different senses?
Let's take look at the verb break using nltk ptb. How many
examples of break are there? Can we extract the verb frames?
|
4/18 |
lecture26.pdf |
lecture26.pptx |
14 |
Viewer |
Optional homeworks 11 and 12.
On corpus ptb from nltk. Make sure it's working properly before
attempting the optional homeworks.
Terminal log: terminal26.txt
Slides modified: 11:10am
|
4/23 |
lecture27.pdf |
lecture27.pptx |
28 |
Viewer |
More on nltk and ptb: (1) Zipf's law, (2) productions, and (3)
looking for words with multiple POS tags and graphing the distribution.
Python program: zipf.py
|
4/25 |
lecture28.pdf |
lecture28.pptx |
31 |
Viewer |
Revisiting the verb break: an example project pipeline.
Is the data out there rich? Comparing the ptb with the Childes
database.
Berkeley Neural Parser: benepar installation and use with nltk and spaCy.
The Childes database: pre-processing a .csv file.
|
May
Date |
Lecture Notes |
Number of Slides |
Panopto |
Topic |
PDF |
Powerpoint |
5/1 |
lecture29.pdf |
lecture29.pptx |
22 |
Viewer |
Reminder aout optional homeworks.
Another experiment with the ptb. A code walk-through.
Treebanks provide statistics about CFG rules and how often they
appear in the corpus.
Suppose we add rules in order from most frequent on down until we
can parse a (given) sentence.
What do the parses look like?
|
To my linguistics homepage
|