Sufficient amounts of language data (text corpora) are absolutely essential for methods of computational linguistics and natural language processing. Rapid development of computer technology allows processing of much larger datasets than before. However, such data is not available. Currently, the largest Czech corpora contaion at most hundreds of millions of tokens (Czech national corpus), which is for many methods not sufficient. Building text corpora is time-consuming and expensive process and can not possibly satisfy needs of current research in the field. The proposed project aims to build a text corpus at least ten times larger than currently available corpora with incomparably lower expenses. The corpus will be build from data publicly available on the internet. Automatically downloaded data will be filtered, cleaned up and linguistically processed. Language quality of such corpus will be, due to completely automatic processing, lower compared to quality of classical corpora, but its significant advantage will be size. (en)
Sestavení rozsáhlého českého textového korpusu z dat dostupných na internetu a jeho základní lingvistické zpracování automatickými metodami.