About: Downdating lexicon and language model for automatic transcription of Czech historical spoken documents     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
Description
  • This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%.
  • This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%. (en)
Title
  • Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
  • Downdating lexicon and language model for automatic transcription of Czech historical spoken documents (en)
skos:prefLabel
  • Downdating lexicon and language model for automatic transcription of Czech historical spoken documents
  • Downdating lexicon and language model for automatic transcription of Czech historical spoken documents (en)
skos:notation
  • RIV/46747885:24220/13:#0002791!RIV14-MK0-24220___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(DF11P01OVV013)
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 70523
http://linked.open...ai/riv/idVysledku
  • RIV/46747885:24220/13:#0002791
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • historical speech recognition; oral archives; lexicon (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...ontrolniKodProRIV
  • [ECC46B6B8DB4]
http://linked.open...v/mistoKonaniAkce
  • Czech Republic, Pilsen
http://linked.open...i/riv/mistoVydani
  • Germany, Berlin
http://linked.open...i/riv/nazevZdroje
  • 16th International Conference, TSD 2013
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...iv/tvurceVysledku
  • Chaloupka, Josef
  • Málek, Jiří
  • Nouza, Jan
  • Červa, Petr
http://linked.open...vavai/riv/typAkce
http://linked.open.../riv/zahajeniAkce
issn
  • 0302-9743
number of pages
http://bibframe.org/vocab/doi
  • 10.1007/978-3-642-40585-3_26
http://purl.org/ne...btex#hasPublisher
  • Springer-Verlag Berlin Heidelber
https://schema.org/isbn
  • 9783642405846
http://localhost/t...ganizacniJednotka
  • 24220
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 48 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software