This HTML5 document contains 44 embedded RDF statements represented using HTML+Microdata notation.

The embedded RDF content will be recognized by any processor of HTML5 Microdata.

Namespace Prefixes

PrefixIRI
n13http://linked.opendata.cz/ontology/domain/vavai/riv/typAkce/
dctermshttp://purl.org/dc/terms/
n8http://localhost/temp/predkladatel/
n7http://purl.org/net/nknouf/ns/bibtex#
n18http://linked.opendata.cz/resource/domain/vavai/projekt/
n5http://linked.opendata.cz/resource/domain/vavai/riv/tvurce/
n21http://linked.opendata.cz/ontology/domain/vavai/
n10https://schema.org/
shttp://schema.org/
skoshttp://www.w3.org/2004/02/skos/core#
n3http://linked.opendata.cz/ontology/domain/vavai/riv/
n14http://linked.opendata.cz/resource/domain/vavai/vysledek/RIV%2F00216208%3A11320%2F10%3A10078038%21RIV11-GA0-11320___/
n2http://linked.opendata.cz/resource/domain/vavai/vysledek/
rdfhttp://www.w3.org/1999/02/22-rdf-syntax-ns#
n6http://linked.opendata.cz/ontology/domain/vavai/riv/klicoveSlovo/
n17http://linked.opendata.cz/ontology/domain/vavai/riv/duvernostUdaju/
xsdhhttp://www.w3.org/2001/XMLSchema#
n20http://linked.opendata.cz/ontology/domain/vavai/riv/jazykVysledku/
n12http://linked.opendata.cz/ontology/domain/vavai/riv/aktivita/
n16http://linked.opendata.cz/ontology/domain/vavai/riv/obor/
n4http://linked.opendata.cz/ontology/domain/vavai/riv/druhVysledku/
n15http://reference.data.gov.uk/id/gregorian-year/

Statements

Subject Item
n2:RIV%2F00216208%3A11320%2F10%3A10078038%21RIV11-GA0-11320___
rdf:type
skos:Concept n21:Vysledek
dcterms:description
Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to fi Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to fi
dcterms:title
Building a Web Corpus of Czech Building a Web Corpus of Czech
skos:prefLabel
Building a Web Corpus of Czech Building a Web Corpus of Czech
skos:notation
RIV/00216208:11320/10:10078038!RIV11-GA0-11320___
n3:aktivita
n12:P
n3:aktivity
P(GA405/09/0278), P(LC536)
n3:dodaniDat
n15:2011
n3:domaciTvurceVysledku
n5:7699611 n5:1305522 n5:2787865
n3:druhVysledku
n4:D
n3:duvernostUdaju
n17:S
n3:entitaPredkladatele
n14:predkladatel
n3:idSjednocenehoVysledku
249360
n3:idVysledku
RIV/00216208:11320/10:10078038
n3:jazykVysledku
n20:eng
n3:klicovaSlova
czech; corpus; building
n3:klicoveSlovo
n6:building n6:czech n6:corpus
n3:kontrolniKodProRIV
[47E7329605E8]
n3:mistoKonaniAkce
Valletta, Malta
n3:mistoVydani
Valletta, Malta
n3:nazevZdroje
Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010)
n3:obor
n16:AI
n3:pocetDomacichTvurcuVysledku
3
n3:pocetTvurcuVysledku
3
n3:projekt
n18:LC536 n18:GA405%2F09%2F0278
n3:rokUplatneniVysledku
n15:2010
n3:tvurceVysledku
Spoustová, Drahomíra Spousta, Miroslav Pecina, Pavel
n3:typAkce
n13:WRD
n3:zahajeniAkce
2010-05-17+02:00
s:numberOfPages
4
n7:hasPublisher
European Language Resources Association
n10:isbn
2-9517408-6-7
n8:organizacniJednotka
11320