About: Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
Description
  • Historical spoken documents represent a unique segment of national cultural heritage. In order to disclose the large Czech Radio audio archive to research community and to public, we have been developing a system whose aim is to transcribe automatically the archive files, index them and make them searchable. The transcription of contemporary (1 or 2 decades old) documents is based on the lexicon and statistical language model (LM) built from a large amount of recent texts available in electronic form. From the older periods (before 1990), however, digital texts do not exist. Therefore, we needed a) to find resources that represent language of those times, b) to convert them from their original form to text, c) to utilize this text for creating epoch specific lexicons and LMs, and eventually, d) to apply them in the developed speech recognition system. In our case, the main resources included: scanned historical newspapers, shorthand notes from the national parliament and subtitles from retro TV programs. When converted into text, they allowed us to built a more appropriate lexicon and to produce a preliminary version of the transcriptions. These were reused for unsupervised retraining of the final LM. In this way, we significantly improved the accuracy of the automatically transcribed radio news broadcast in 1969-1989 era, from initial 83 % to 88 %.
  • Historical spoken documents represent a unique segment of national cultural heritage. In order to disclose the large Czech Radio audio archive to research community and to public, we have been developing a system whose aim is to transcribe automatically the archive files, index them and make them searchable. The transcription of contemporary (1 or 2 decades old) documents is based on the lexicon and statistical language model (LM) built from a large amount of recent texts available in electronic form. From the older periods (before 1990), however, digital texts do not exist. Therefore, we needed a) to find resources that represent language of those times, b) to convert them from their original form to text, c) to utilize this text for creating epoch specific lexicons and LMs, and eventually, d) to apply them in the developed speech recognition system. In our case, the main resources included: scanned historical newspapers, shorthand notes from the national parliament and subtitles from retro TV programs. When converted into text, they allowed us to built a more appropriate lexicon and to produce a preliminary version of the transcriptions. These were reused for unsupervised retraining of the final LM. In this way, we significantly improved the accuracy of the automatically transcribed radio news broadcast in 1969-1989 era, from initial 83 % to 88 %. (en)
Title
  • Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives
  • Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives (en)
skos:prefLabel
  • Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives
  • Using Various Types of Multimedia Resources to Train System for Automatic Transcription of Czech Historical Oral Archives (en)
skos:notation
  • RIV/46747885:24220/13:#0002790!RIV14-MK0-24220___
http://linked.open...avai/predkladatel
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(DF11P01OVV013)
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 113196
http://linked.open...ai/riv/idVysledku
  • RIV/46747885:24220/13:#0002790
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • historical audio archives; speech-to-text transcription; OCR; lexicon building; machine learning (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...ontrolniKodProRIV
  • [705C47E2708E]
http://linked.open...v/mistoKonaniAkce
  • Italy, Naples
http://linked.open...i/riv/mistoVydani
  • Germany, Berlin
http://linked.open...i/riv/nazevZdroje
  • New Trends in Image Analysis and Processing - ICIAP 2013
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...iv/tvurceVysledku
  • Chaloupka, Josef
  • Nouza, Jan
  • Kuchařová, Michaela
http://linked.open...vavai/riv/typAkce
http://linked.open.../riv/zahajeniAkce
issn
  • 0302-9743
number of pages
http://bibframe.org/vocab/doi
  • 10.1007/978-3-642-41190-8_25
http://purl.org/ne...btex#hasPublisher
  • Springer-Verlag Berlin Heidelber
https://schema.org/isbn
  • 9783642411892
http://localhost/t...ganizacniJednotka
  • 24220
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 97 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software