Personal tools

A statistical model for unsupervised linguistic inference

— filed under:

Michael Lamar, SLU

  • Colloquium
When Thu, Mar 03, 2011
from 11:00 AM to 12:00 PM
Where RH 223
Add event to calendar vCal

Abstract:  A long-standing problem in computational linguistics is the categorization into parts-of-speech of words from a corpus of natural language without the use of annotated training data.  After a brief discussion of why this problem is of interest and why it has proven difficult to solve, we will introduce a statistical model that has proven to be quite successful in attacking the problem.  The model attempts to embed words into a high-dimension Euclidean space in such a way as to maximize the likelihood of observing the sequence of words found in the corpus.  The embedding of this categorical data allows the use of classical machine learning clustering techniques (e.g. k-means clustering) to group the words in a way that is consistent with the linguistic notion of part-of-speech.  Performance of this approach will be compared to a number of other recent statistical models based on Markov chains.

Reception to follow.

« January 2019 »