About: Downdating lexicon and language model for automatic transcription of Czech historical spoken documents

Facets (new session)
Description
Metadata
Settings
- owl:sameAs
- Inference Rule:

About: Downdating lexicon and language model for automatic transcription of Czech historical spoken documents Goto Sponge NotDistinct Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

Attributes	Values
rdf:type	skos:Concept http://linked.opendata.cz/ontology/domain/vavai/Vysledek
Description	This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%. This paper deals with the task of adaptation of an existing Czech largevocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rud´e Pr´avo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%. (en)
Title	Downdating lexicon and language model for automatic transcription of Czech historical spoken documents Downdating lexicon and language model for automatic transcription of Czech historical spoken documents (en)
skos:prefLabel	Downdating lexicon and language model for automatic transcription of Czech historical spoken documents Downdating lexicon and language model for automatic transcription of Czech historical spoken documents (en)
skos:notation	RIV/46747885:24220/13:#0002791!RIV14-MK0-24220___
http://linked.open...avai/riv/aktivita	P
http://linked.open...avai/riv/aktivity	P(DF11P01OVV013)
http://linked.open...vai/riv/dodaniDat	2014
http://linked.open...aciTvurceVysledku	Chaloupka, Josef Červa, Petr Nouza, Jan Málek, J.
http://linked.open.../riv/druhVysledku	D - Článek ve sborníku
http://linked.open...iv/duvernostUdaju	S - Úplné a pravdivé údaje nepodléhající ochraně podle zvláštních právních předpisů
http://linked.open...titaPredkladatele	Technická univerzita v Liberci / Fakulta mechatroniky, informatiky a mezioborových studií
http://linked.open...dnocenehoVysledku	70523
http://linked.open...ai/riv/idVysledku	RIV/46747885:24220/13:#0002791
http://linked.open...riv/jazykVysledku	eng - angličtina
http://linked.open.../riv/klicovaSlova	historical speech recognition; oral archives; lexicon (en)
http://linked.open.../riv/klicoveSlovo	oral archives lexicon historical speech recognition
http://linked.open...ontrolniKodProRIV	[ECC46B6B8DB4]
http://linked.open...v/mistoKonaniAkce	Czech Republic, Pilsen
http://linked.open...i/riv/mistoVydani	Germany, Berlin
http://linked.open...i/riv/nazevZdroje	16th International Conference, TSD 2013
http://linked.open...in/vavai/riv/obor	JC
http://linked.open...ichTvurcuVysledku	4 (xsd:int)
http://linked.open...cetTvurcuVysledku	4 (xsd:int)
http://linked.open...vavai/riv/projekt	Disclosure of the Czech Radio archive for sophisticated search
http://linked.open...UplatneniVysledku	2013
http://linked.open...iv/tvurceVysledku	Chaloupka, Josef Málek, Jiří Nouza, Jan Červa, Petr
http://linked.open...vavai/riv/typAkce	WRD - Světová
http://linked.open.../riv/zahajeniAkce	2013-09-01 (xsd:date)
issn	0302-9743
number of pages	8 (xsd:int)
http://bibframe.org/vocab/doi	10.1007/978-3-642-40585-3_26
http://purl.org/ne...btex#hasPublisher	Springer-Verlag Berlin Heidelber
https://schema.org/isbn	9783642405846
http://localhost/t...ganizacniJednotka	24220

Faceted Search & Find service v1.16.118 as of Jun 21 2024

Alternative Linked Data Documents: ODE Content Formats:

RDF

ODATA

Microdata

About

OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 48 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software