Speaker: Takashi Morita (MIT) & Timothy O’Donnell (McGill)
Title: Bayesian learning of Japanese sublexica
Date/Time: Thursday, November 30th, 5:00-6:30pm
Location: 46-5165
Abstract:
Languages have been borrowing words from each other.
A borrower language often has a different list of possible sound sequences (phonotactics) from a lender’s.
While loanwords may be reshaped so that they fit to the borrower’s phonotactics, they can also introduce new sound patterns into the language.
Accordingly, native and loanwords can exhibit different phonotactics in a single language and linguists have proposed that such a language’s lexicon is better explained by a mixture of multiple phonotactic grammars: words are classified into sublexica (e.g. native vs. loan), and words belonging to different sublexica are subject to different phonotactic constraints.
This approach, however, raises a non-trivial learnability question: Can learners classify words into correct sublexica?
Words are not labeled with their sublexicon, so learners need to infer the classification.
In this study, we investigate Bayesian unsupervised learning of sublexica.
We focus on Japanese data (coded in international phhonetic alphabet), whose sublexical phonotactics has been proposed in linguistic literature.
It will turn out that even a simple Dirichlet process mixture of ngram leads to remarkably successful classification.