Recent years have seen an explosive growth in the number of texts that are available in digital libraries, such as Google books and the dLib.si library of the National and University Library of Slovenia. Most of these books are old - they promote our cultural heritage and are out of copyright, which makes publishing them on the Web much easier. However, digital historical texts bring with them a number of problems. It is difficult to do full-text search on them, as spelling of words has changed over time and there is no support for lemmatisation. Furthermore, such books are typically available only as PDF scans, and automatic OCR is of poor quality, esp. for materials older than 1900.
This talk presents work of the last two years in producing a set of language resources for historical Slovene, and associated tools, aimed at alleviating these problems, as well as enabling computer supported studies of historical Slovene. I will present an annotated reference corpus of 1,000 pages of historical texts, a text collection of a few million words, a computational lexicon, and a tool for text annotation. For the resources and tools each word is first modernised, and then tagged and lemmatised. The modernisation relies on a transcription rules, and has the benefit of making the text easier to read by today's speakers, as well as enabling standard tagging and lemmatisation models to be used on the text. We present the workflow and tools used in developing the resources, and show the results.
This work in progress, to be finished in 2012, is supported by the EU project IMPACT "Improving Access to Text" and the Google Humanities Research Award "Developing computational models for historical Slovene".
Naročite se na:
Objavi komentarje (Atom)
0 komentarji:
Objavite komentar