About: Language Identification on the Web: Extending the Dictionary Method     Goto   Sponge   NotDistinct   Permalink

An Entity of Type : http://linked.opendata.cz/ontology/domain/vavai/Vysledek, within Data Space : linked.opendata.cz associated with source document(s)

AttributesValues
rdf:type
Description
  • Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.
  • Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character $n$-grams are in use, mainly with identification based on Markov Processes or on character $n$-gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents. (en)
Title
  • Language Identification on the Web: Extending the Dictionary Method
  • Language Identification on the Web: Extending the Dictionary Method (en)
skos:prefLabel
  • Language Identification on the Web: Extending the Dictionary Method
  • Language Identification on the Web: Extending the Dictionary Method (en)
skos:notation
  • RIV/00216224:14330/09:00067120!RIV14-MSM-14330___
http://linked.open...avai/riv/aktivita
http://linked.open...avai/riv/aktivity
  • P(LC536), S
http://linked.open...vai/riv/dodaniDat
http://linked.open...aciTvurceVysledku
http://linked.open.../riv/druhVysledku
http://linked.open...iv/duvernostUdaju
http://linked.open...titaPredkladatele
http://linked.open...dnocenehoVysledku
  • 323178
http://linked.open...ai/riv/idVysledku
  • RIV/00216224:14330/09:00067120
http://linked.open...riv/jazykVysledku
http://linked.open.../riv/klicovaSlova
  • machine learning; language segmentation; language identification (en)
http://linked.open.../riv/klicoveSlovo
http://linked.open...ontrolniKodProRIV
  • [A5273553D9CC]
http://linked.open...v/mistoKonaniAkce
  • Mexico City, Mexico
http://linked.open...i/riv/mistoVydani
  • Mexico City, Mexico
http://linked.open...i/riv/nazevZdroje
  • Computational Linguistics and Intelligent Text Processing, 10th International Conference, CICLing 2009, Proceedings.
http://linked.open...in/vavai/riv/obor
http://linked.open...ichTvurcuVysledku
http://linked.open...cetTvurcuVysledku
http://linked.open...vavai/riv/projekt
http://linked.open...UplatneniVysledku
http://linked.open...iv/tvurceVysledku
  • Řehůřek, Radim
  • Kolkus, Milan
http://linked.open...vavai/riv/typAkce
http://linked.open...ain/vavai/riv/wos
  • 000265681200029
http://linked.open.../riv/zahajeniAkce
issn
  • 0302-9743
number of pages
http://bibframe.org/vocab/doi
  • 10.1007/978-3-642-00382-0_29
http://purl.org/ne...btex#hasPublisher
  • Springer-Verlag
https://schema.org/isbn
  • 9783642003813
http://localhost/t...ganizacniJednotka
  • 14330
Faceted Search & Find service v1.16.118 as of Jun 21 2024


Alternative Linked Data Documents: ODE     Content Formats:   [cxml] [csv]     RDF   [text] [turtle] [ld+json] [rdf+json] [rdf+xml]     ODATA   [atom+xml] [odata+json]     Microdata   [microdata+json] [html]    About   
This material is Open Knowledge   W3C Semantic Web Technology [RDF Data] Valid XHTML + RDFa
OpenLink Virtuoso version 07.20.3240 as of Jun 21 2024, on Linux (x86_64-pc-linux-gnu), Single-Server Edition (126 GB total memory, 85 GB memory in use)
Data on this page belongs to its respective rights holders.
Virtuoso Faceted Browser Copyright © 2009-2024 OpenLink Software