About: Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
Description
  • In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database.
  • In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database. (en)
Title
  • Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data
  • Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data (en)
skos:prefLabel
  • Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data
  • Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data (en)
skos:notation
  • RIV/68407700:21230/10:00169505!RIV11-GA0-21230___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(GA102/08/0707), Z(MSM6840770014)
http://linked.open...iv/cisloPeriodika
  • 2010933819
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 246367
http://linked.open...ai/riv/idVysledku
  • RIV/68407700:21230/10:00169505
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • statistical language model; text corpora; Czech Web 1T 5-gram; Czech National Corpus; HTK Toolkit (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...odStatuVydavatele
  • DE - Spolková republika Německo
http://linked.open...ontrolniKodProRIV
  • [C04124E0B1DC]
http://linked.open...i/riv/nazevZdroje
  • Lecture Notes in Artificial Intelligence
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...v/svazekPeriodika
  • 6231
http://linked.open...iv/tvurceVysledku
  • Pollák, Petr
  • Procházka, Václav
http://linked.open...ain/vavai/riv/wos
  • 000288619400024
http://linked.open...n/vavai/riv/zamer
issn
  • 0302-9743
number of pages
http://localhost/t...ganizacniJednotka
  • 21230
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 58 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software