2008-03-12

BrownBag Mar 13th at 12:30: Accurate POS Tagger for Languages with 1000+ Morphosyntactic Descriptors

Miha Grčar & Jan Rupnik

At the Brown Bag we will present the work that was done in the context of two seminar assignments at Jozef Stefan International Postgraduate School.

In the context of the Language Technologies course we have implemented a part-of-speech tagger which is able to handle large tagsets. The mission was (still is) to implement an accurate POS tagger for the Slovene language. We will talk about how the rich inflectional morphology of the Slovene language affects POS tagging, why we decided upon using SVM-based approach over hidden Markov models, which issues we had to deal with given the chosen technology and its constraints, how we constructed feature vectors for training, and how we reduced the size of the model.

One of the approaches to reduce the size of the model while hoping to increase its efficiency and accuracy at the same time was to divide the model into several smaller models. We trained a decision tree (presented at the Knowledge Discovery course) to divide the model in the training phase and to decide upon which model to use to tag a particular word in the tagging phase.

We were successful in several of our approaches and manage to implement a tagger that is highly accurate on known words. It is still beaten by TnT when it comes to unknown words. However, we have many promising ideas on how to fix this ...

0 komentarji: